/
Fast Concurrent Queues for x Processors Adam Morrison Yehuda Afek Blavatnik School of Fast Concurrent Queues for x Processors Adam Morrison Yehuda Afek Blavatnik School of

Fast Concurrent Queues for x Processors Adam Morrison Yehuda Afek Blavatnik School of - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
569 views
Uploaded On 2015-01-19

Fast Concurrent Queues for x Processors Adam Morrison Yehuda Afek Blavatnik School of - PPT Presentation

In building concurrent FIFO queues this reasoning has led re searchers to propose combiningbased concurrent queues This paper takes a different approach showing how to rely on fetchandadd FA a less powerful primitive that is available on x86 process ID: 33217

building concurrent FIFO

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Fast Concurrent Queues for x Processors ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure1:TimetoincrementacontendedcounteronasystemwithfourIntelXeonE7-4870(WestmereEX)processors,eachofwhichhas102.40GHzcoresthatmultiplex2hardwarethreads.TherightverticalaxisshowsthenumberofCASittakestocompleteanincrement.OneoftheCRQ'sdistinctivepropertiescomparedtopriorcon-currentcirculararrayqueues[2–4,9,10,22,23]isthatinthecom-moncaseanoperationaccessesonlytheCRQ'sheadortailbutnotboth.ThisreducestheCRQ'ssynchronizationcostbyafactoroftwo,sincethecontendedheadandtailarethealgorithm'sbottle-neck.2.RelatedworkWereferthereadertoMichaelandScott'sextensivesurvey[19]fordiscussionofadditionalworkthatpredatestheirs.ListbasedqueuesMichaelandScottpresenttwolinkedlistqueues,onenonblocking(henceforthMSqueue)andonelock-based[19].However,duetocontentiononthequeue'sheadandtail,theiralgorithmsdonotscalepastalowlevelofconcur-rency[7,11].KoganandPetrankintroduceawait-freevariantoftheMSqueuewithsimilarperformancecharacteristics[16].Sev-eralworksattempttoimprovetheMSqueue'sscalability,howeverallthesestillsufferfromtheCASretryproblem[15,17,20].CyclicarrayqueuesPriorconcurrentcyclicarrayqueuesareboundedandcancontainaxednumberofitems.Oneofthechallengesinthesealgorithmsiscorrectlydeterminingwhenthequeueisfullandempty.ThequeuesofGottliebetal.[10]andofFreudenthalandGottlieb[9]maintainasizecounterthatisupdatedusingF&A.SuchaF&Amightbringthequeueintoaninconsis-tentstate(e.g.,size0)andthealgorithmthentriestorecoverus-ingacompensatingF&A.Still,theinconsistentstatesmakethesequeuesnon-linearizable1.Blellochetal.[3]useroomsynchroniza-tion,whichpreventsenqueuesfromrunningconcurrentlytode-queues,toconstructaqueuethatislinearizabledespitetemporar-ilyenteringinconsistentstateswhenitshead/tailareupdatedus-ingF&A.AnotherqueuebyBlellochetal.[2]avoidsinconsistentstatesoftheheadandtailbyupdatingtheseindicesusinghardwarememoryblocktransactionswhicharenotsupportedbycommer-cialhardware.TsigasandZhang[23],ColvinandGroves[4]andShaei[22]presentcyclicarrayqueuesthatavoidinconsistentheadandtailstatesbyperformingtheupdatesusingCAS,butarethere-forepronetotheCASfailureeffect.Incontrasttothesepriordesigns,LCRQisanunboundedqueueformedbylinkingCRQs(arrayqueues)inalist,withanewCRQ 1Blellochetal.[3]showanon-linearizableexecutionforGottliebetal.'squeue.AsimilarscenarioappliestoFreudenthalandGottlieb'squeue.addedwhenanenqueueoperationfailstomakeprogress.TheabilitytocloseaCRQ,forcingenqueuestomovetothenextCRQinthelist,makesLCRQnonblockingwhereaspriorF&A-baseddesigns[2,3,9,10]areblocking.Inaddition,sincewedonotneedtodeterminewhenthequeueisfullinalinearizableway,wecanrecoverfrominconsistentstatesthatresultfromusingF&Aforhead/tailupdateswithoutcompromisinglinearizability.Performance-wise,aCRQoperationaccessesonlyoneendofthequeueinthecommoncase,whereastheoperationsinthepreviousdesignsaccessboththeheadandtailindices.CombiningResearchershaverecentlyshownthatcombining-basedqueuesscalebetterthanCAS-basedlistqueues[7,8,11].Acombiningalgorithmisessentiallyauniversalconstruction[12]thatcanimplementanysharedobject.Theideaisthatasinglethreadscansalistofpendingoperationsandappliesthemtotheobject.Suchalgorithmsgreatlyreducethesynchronizationcostofaccessingtheobject,atthecostofexecutingworkserially.Hendleretal.describealinkedlistqueuebasedonatcombin-ing,alock-basedcombiningconstruction[11].FatourouandKalli-manispresentSimQueue[8],aqueuebasedonawait-freecom-biningconstruction,andCC-Queue,aqueuebasedonablockingcombiningalgorithm[7].Section5detailsthesealgorithms.BothofFatourouandKallimanis'algorithmsuseweaksynchro-nizationprimitives(F&AandSWAP).However,theydosotore-ducethesynchronizationcostofthecombiningalgorithm,whichstillneedstoperformserialworkthatislinearinthenumberofthreads.Incontrast,weuseF&Atoenableparallelismintheseem-inglyinherentlysequentialFIFOqueue.3.PreliminariesSystemmodelMostconcurrentalgorithmsworkassumesase-quentiallyconsistentsharedmemorysystem,particularlyforcor-rectnessproofs,asthisallowsmodelingtheexecutionasasequenceofinterleavedmemoryoperationsperformedbythethreads.Whilethex86memorymodelisnotsequentiallyconsistent,theonlydif-ferenceisthatonthex86awritegetsbufferedinawritebufferbeforereachingthememory,allowingareadtobesatisedfrommemorybeforeawriteprecedingitbecomesgloballyvisible[21].However,inouralgorithmsthreadswritetoshareddataonlywithatomicoperations,suchasCASandF&A.Atomicoperationsushthewritebufferandaregloballyordered[21],allowingustotreatthesystemassequentiallyconsistent.Formally,wehaveaset ofTsequentialthreadsthatcommunicatebyperformingoperationsonthesharedmemory,asdescribedbelow.MemoryoperationsThememoryisanarrayoflocations,eachholdinga64-bitvalue.Weusethenotationm[a]forthevaluestoredinaddressaofthememory.Ouralgorithmsusethefollow-ingprimitivessupportedbythex86architecture:(1)read(a)whichreturnsm[a],(2)fetch-and-add,denotedF&A(a;x),whichre-turnsv=m[a]andchangesm[a]'svaluetov+x,(3)swap,denotedSWAP(a;x),whichreturnsv=m[a]andchangesm[a]'svaluetox,(4)test-and-set,denotedT&S(a),whichreturnsv=m[a]andchangesm[a]'svalueto1,(5)compare-and-swap,denotedCAS(a;o;n),whichchangesm[a]'svaluetonifm[a]=oandre-turnsTRUE,orreturnsFALSEotherwise,(6)compare-and-swap2,denotedCAS2(a;ho0;o1i;hn0;n1i),whichchangesm[a]'svalueton0andm[a+1]'svalueton1ifm[a]=o0andm[a+1]=o1beforereturningTRUE,orelsereturnsFALSE2.ConcurrentobjectsThethreadsimplementahigh-levelobjectdenedbyasequentialspecication,astatemachinespecifyingtheobject'sstatesandtheoperationsusedtotransitionbetweenthestates.HereweareconcernedwiththeFIFOqueue,anobjectwhosestate,Q,isa(possiblyempty)sequenceofitems.Itsupportsanenqueue(x)operationthatappendsxtoQandreturnsOK,andadequeue()operationwhichremovestherstitemxfromQandreturnsx,orreturnsEMPTYifQistheemptysequence.Implementations,executionsandlinearizabilityWeusethestandarddenitionsofahigh-levelobjectimplementationanditsexecution[14].Ourcorrectnessconditionislinearizability[14],which(informally)requiresthatahigh-leveloperationappearstotakeplaceatonepointintimeduringitsexecutioninterval.ProgressAccordingtoHerlihy'snowstandarddenition[12],animplementationisnonblockingifitguaranteesthatsomethreadcompletesanoperationinanitenumberofsteps.Inotherwords,anindividualoperationmaystarve,butsomeoperationalwaysmakesprogress.Thisguaranteestillallowssomeundesirablesce-nariosforqueues,e.g.,anexecutioninwhichenqueuersarestarvedbydequeuersreturningEMPTY.Nonblockingqueuesintheliter-ature[4,19,22,23]actuallyprovideastrongerguarantee,whichwecallop-wisenonblocking3:someenqueue()completesinanitenumberofstepsbyenqueuingthreads,andsomedequeue()completesinanitenumberofstepsbydequeuingthreads.4.TheLCRQalgorithmLCRQcanbeviewedasapracticalrealizationofthefollowingsimplebutunrealisticqueuealgorithm(Figure2).Thealgorithmrepresentsthequeueusinganinnitearray,Q,with(unbounded)headandtailindicesthatidentifythepartofQwhichmaycontainitems.Initially,eachcellQ[i]isemptyandcontainsareservedvalue?thatmaynotbeenqueued.Anenqueue(x)operationobtainsacellindextviaaF&Aontail.TheenqueuethenatomicallyswapsthevalueinQ[t]withx.Iftheswapreturns?,theenqueueoperationcompletes;otherwise,itrepeatsthisprocess.Adequeue,D,obtainsacellindexhusingF&AonheadandatomicallyswapsthevalueinQ[h]withanotherreservedvalue�.IfQ[h]containedsomex6=?,Dreturnsx.IfDnds?inQ[h],thefactthatDstored�inthecellguaranteesthatanenqueueoperationwhichlaterstoresaniteminQ[h]willnotcomplete.DthenreturnsEMPTYiftailh+1(h+1isthevalueofheadfollowingD'sF&A).IfDcannotreturnEMPTY,itrepeatsthisprocess. 2Onthex86theseatomicprimitivesareknownasLOCKXADD,XCHG,LOCKBTS,LOCKCMPXCHGandLOCKCMPXCHG16B.3Wearenotawareofthispropertybeingexplicitlypointedoutbefore. 1enqueue(x:Object)f2while(true)f3t:=F&A(&tail,1)4if(SWAP(&Q[t],x)=?)returnOK5gg6dequeue()f7while(true)f8h:=F&A(&head,1)9x:=SWAP(&Q[h],�)10ifx6=?returnx11if(tailh+1)returnEMPTY12gg Figure2:Innitearrayqueue.WhilethisalgorithmisalinearizableFIFOqueue4ithastwomajorawsthatpreventitfrombeingrelevantinpractice:usinganinnitearrayandsusceptibilitytolivelock(adequeuercontin-uouslyswaps�intothecellanenqueuerisabouttoaccess).WeobtainthepracticalLCRQalgorithmbysolvingtheseproblems.Ourarrayqueue,CRQ,transformstheinnitearraytoacyclicarray(ring)ofRnodes.Theheadandtailindicesstillstrictlyincrease,butnowthevalueofanindexmoduloRspeciestheringnodeitpointsto.Sincenowmorethanoneenqueuerandde-queuercanconcurrentlyaccessanode,wereplacetheinnitearrayqueue'sSWAP-basedexchangewithaCAS2-basedprotocol.Thisprotocolisuniqueinthat,unlikepriorwork[2,10],anoperationdoesnothavetowaitforthecompletionofoperationswhoseF&Areturnssmallerindicesthatalsopointtothesameringnode.TheCRQ'scrucialperformancepropertyisthatinthecommonfastpath,anoperationaccessesonlyoneF&A.Weusetheaddi-tionalsynchronizationintheringnodestodetectcornercases,suchasanemptyqueue.Sinceheadandtailareheavilycontended,ourapproachhalvesanoperation'ssynchronizationcostinthecommoncase.WedetailtheCRQalgorithminSection4.1.TheLCRQalgorithm(Section4.2)buildsonCRQtopreventthelivelockproblem.WerepresentthequeueasalinkedlistofCRQs.Anenqueue(x)operationfailingtomakeprogressinthetailCRQclosesittofurtherenqueues.UponnoticingthetailCRQisclosed,eachenqueuertriestoappendanewCRQ,initializedtocontainitsitem,tothelist.Oneenqueuersucceedsandcompletes;therestmoveintothenewtailCRQ,leavingtheoldtailCRQwithonlydequeuersinsideit,whichallowsthedequeuerstocomplete.TheLCRQisthusop-wisenonblocking.4.1TheCRQalgorithmThepseudocodeofthebasicCRQalgorithmappearsinFigure3.TheCRQrepresentsthequeueasaring(cyclicarray)ofRnodes,with64-bitheadandtailindices(Figure3a).AnindexwithvalueipointstonodeimodR,whichwedenotebynode(i).WereservethemostsignicantbitoftailtodenotetheCRQ'sCLOSEDstate.Wethusmaketherealisticassumptionthatbothheadandtaildonotexceed263.ThesynchronizationprotocolinaCRQringnodeneedstohandlemorecasesthantheinnitearrayqueue,whichonlyneedstodistinguishwhetheranenqueueordequeuearrivesrstatthenode.Weproceedtodescribethisprotocolandhowithandlesthesecases.Nodestructure(Figure3a)Physically,aringnodecontainstwo64-bitwords.Logically,aringnodeisa3-tuple(s;i;v)consistingof(1)asafebits(usedbyadequeuertonotifythematchingenqueuerthatstoringaniteminthenodeisunsafeasthedequeuerwillnotbearoundtodequeueit;weexplainthedetailsbelow),(2)anindexi,and(3)avaluev.Initially,nodeu'sstateis(1;u;?)forevery0uR. 4Weomitthefullproof,whichissimilartotheproofinSection4.1.2. innitearray,Q,coupledwithindiceshead(Q)andtail(Q)rep-resentingQ'sheadandtail.(NotethatQisnotcyclic.)Initially,tail(Q)=head(Q)=0andQ[i]=?foralli.Weprocesstheexecutiononeeventatatime,inorderofexe-cution,butusinginformationaboutfutureeventstodecidewhentolinearizeanoperation.Whenwelinearizeanoperationwealsoap-plyittotheauxiliaryqueue.Welinearizeanhenq(x):OKionits-nalF&A,theonereturningindextsuchthattheoperationenqueuesxinnode(t).Atthispointwealsosettail(Q)tot+1.Welinearizethedequeueofitemx=Q[h]assoonasthedequeuebecomesac-tiveandhisthelowestindexednon-?cellinQ,andsethead(Q)toh+1atthispoint.Welinearizeahdeq:EMPTYionitsreadoftailthatreturnsavaluehead(welatershowthathead(Q)=tail(Q)atthispoint).ThefullpseudocodeofPinFigure4alsoincludesthestraightforwardcasesoflinearizinghenq(x):CLOSEDiopera-tions.Byconstruction,thelinearizationpointofanoperationiswithinitsexecutioninterval,andallcompletedenqueuesandalldequeuesthatreturnEMPTYarelinearized.WenowshowthatcompleteddequeueswhichdonotreturnEMPTYarealsolinearized.Herewedenotebyenqi(x)thehenq(x):OKioperationwhoselastF&AontailinEreturnsi,causingPtosetQ[i]:=xandlinearizeit.Similarly,wedenoteadequeueoperationwhoselastF&AonheadinEreturnsibydeqi.Lemma1.SupposePlinearizesenqi(x).IfthereexistsadequeueoperationdeqthatperformsaF&AonheadinEwhichreturnsi,then:(1)deq=deqi(i.e.,deqperformsnofurtherF&AsinE),(2)deqireturnsxifitcompletes,and(3)Plinearizeshdeqi:xi.Proof.Let(s;j;?)7!(1;i;x)beenqi(x)'senqueuetransitionstor-ingxintou=node(i)(Figure3d,Line93).Noticethatji.IfdeqtakessufcientlymanystepsafterobtainingifromitsF&Aonhead,itperformsatransitiononuusingindexi.Toseethis,noticethatdeqmovesonfromuwithoutperformingatransitiononlyifitreadsanindex�ifromu(Figure3b,Line39).Becauseenqi'stran-sitionsucceeds,deqistheonlyoperationthatcanmoveu'sindexbeyondi,sothisisimpossible.Now,considerdeq'stransition.Itcannotbe(;k;?)7!(;i+R;?)(Line48)sincethatimpliesenqi'stransitionfails.deq'stran-sitionalsocannotbeoftheform(;k;v)7!(0;k;v)(Line45)be-causethen,enqi'stransitionsucceedingimpliesthatsomeenqueue(possiblyenqi)subsequentlyobtainsindextiandthenobservesheadt,whichisimpossiblesincehead�i.Thus,deq'stransitioncanonlybeadequeueofx.Hence(1)and(2)hold.Weprove(3)usinginductiononk,thenumberoflinearizedenqueueoperations.Fork=0theclaimisvacuouslytrue.Supposenowthatthek-thenqueueoperationlinearizedisenqi(x).IfdeqiexistsinE,thenitdoesnotcompletebeforeenqi(x)'sF&Awhichreturnsi,sinceotherwisedeqidoesnotreturnx,contradicting(2).Therefore,thereexistsarsteventeinwhichQ[i]=xanddeqiisactive.Thusatsomeevente0,atoraftere,Q[i]=xanddeqihasperformedtheF&Aonheadwhichreturnsi.Letidx=fj:ji;Q[j]6=?ate0g.Forallj2idx,deqjstartsbye0(becausedeqi'sF&Ahasreturnedi)anddoesnotcompletebeforee0(asthatimpliesitisnotlinearizedbeforecompleting,contradictingtheinductionhypothesis).Therefore,ate0Plinearizesdeqjforallj2idxandsubsequentlylinearizesdeqi. Tocompletethelinearizabilityproof,wemustshowthatourlinearizationordermeetsthetantrumqueuespecication.BecauseweenqueuetoQ'stail,dequeuefromQ'shead,andfollowingtherstenqueuetoreturnCLOSEDallenqueuesdoso,thisamountstoshowingthattheauxiliaryqueueisemptywhenwelinearizeahdeq:EMPTYioperation.Lemma2belowimpliesthis,becausewelinearizeahdeq:EMPTYiwhenitreadsavaluetfromtail 131//sharedvariablesondistinctcachelines:132tail:pointertoCRQ133head:pointertoCRQ134//initially:135tail=head=emptyCRQ (a)Globals 136dequeue()f137//localvariables138crq:pointertoCRQ139v:64bitvalue140141while(true)f142crq:=head143v:=dequeue(crq)144if(v6=EMPTY)returnv145if(crq.next=null)returnEMPTY146v:=dequeue(crq)147if(v6=EMPTY)returnv148CAS(&head,crq,crq.next)149gg (b)Dequeue 150enqueue(x:Object)f151//localvariables152crq,newcrq:pointertoCRQ153154while(true)f155crq:=tail156if(crq.next6=null)f157CAS(&tail,crq,crq.next)158continue//nextiterationatLine155159g160if(enqueue(crq,x)6=CLOSED)161returnOK162newcrq:=anewCRQinitializedtocontainx163if(CAS(&crq.next,null,newcrq))f164CAS(&tail,crq,newcrq)165returnOK166ggg (c)Enqueue Figure5:PseudocodeoftheLCRQalgorithm,usingalinearizableCRQblackbox.(Figure3b,Line53)suchthatth+1,wherehheadisthevaluethatthedeq'spriorF&Areturns(Line34).Lemma2.Ifatevente,headtail,thenhead(Q)=tail(Q).Proof.Supposetowardsacontradictionthathead(Q)tail(Q)ate.ThenthereexistsaminimalisuchthatQ[i]6=?ate.Becauseweupdatetail(Q)followingtheorderofF&Asontail,itail(Q)tailheadate.Thus,deqiisactivebeforeeandshouldhavebeenlinearizedbyP,acontradiction. Inconclusion,wehaveshownthefollowing.Theorem1.CRQisalinearizableimplementationofatantrumqueue.4.2TheLCRQalgorithmWenowpresentLCRQusingtheCRQasablackbox.TheLCRQissimplyalinkedlistofCRQsinwhichdequeuingthreadsaccesstheheadCRQandenqueuingthreadsaccessthetailCRQ(Figure5a).Anenqueue(x)operationthatreceivesaCLOSEDresponsefromthetailCRQcreatesanewCRQ,initializedtocontainx,andlinksitafterthecurrenttail,therebymakingitthenewtail(Figure5c).IftheheadCRQbecomesEMPTYandthereisanodelinkedafterit,dequeuesmovetothenextnode,afterinstallingitasthenewhead(Figure5b). MemoryreclamationAdequeuethatsuccessfullychangestheheadpointercannotreclaimthememoryusedbytheoldCRQbecausetheremaybeconcurrentoperationsabouttoaccessit(i.e.,stalledjustbeforeLine143orLine160).Weaddressthisproblembyusinghazardpointers[18]toprotectanoperation'sreferencetotheCRQitisabouttoaccess.Weomitthedetails,whicharestandard.LinearizabilityAssumingthattheCRQisalinearizabletantrumqueue,provingthatLCRQisalinearizablequeueimplementationisstraightforward:Theorem2.IfCRQisalinearizabletantrumqueueimplementa-tion,thenLCRQisalinearizablequeueimplementation.Proof.(Sketch)Welinearizeanenqueuethatcompletesafterap-pendinganewCRQtothelistattheCASwhichlinksthenewCRQ(Figure5c,Line163).Welinearizeanyothercompletedop-erationatthepointinwhichitsnalCRQoperationtakesplace.ThenextpointerofaCRQqchangesfromnullonlyafterqbe-comesCLOSED,andconversely,afteraCRQqbecomesCLOSEDnonewenqueuecompletesuntilanewCRQislinkedafterq.Thus,ifq0precedesq1inthelist,anyq1enqueueislinearizedafteranyq0enqueue.Similarly,anyq0dequeueislinearizedbeforeanyq1dequeue.Linearizabilityfollows. 4.2.1LCRQnonblockingproofInthissection,wesketchtheproofofthefollowingtheorem:Theorem3.LCRQisop-wisenonblocking.AnenqueuerthatdoesnotcompletewithinanitenumberofstepsinthetailCRQclosesit.OncetheCRQisclosed,everyenqueuertakingenoughstepstriestoappendanewCRQtotheLCRQ.TherstonetoCAStheCRQ'snextpointer(Figure5c,Line163)succeedsandcompletes.Thus,anenqueueoperationcompleteswithinanitenumberofstepsbyenqueuingthreads.Now,consideradequeuerdeqtakinganinnitenumberofstepswithoutcompleting.SupposerstthatdeqremainsinoneLCRQnode,q.Ifenqueuerstakeinnitelymanystepsinq,thenqdoesnotcloseandso,becauseq'ssizeisnite,dequeuersremoveitemsfromq.Ifenqueuerstakeonlynitelymanystepsinq,thenfromsomepointonlydequeuerstakestepsinqandsoeventuallyq'sheadexceedsitstail.Thendeqndsthatqisempty(Lines53-54),entersfixState()butneverleaves.Thus,newdequeuerscon-tinuetoenterqandincrementhead.Sincethenumberofdequeuersisnite,thisimpliessomedequeuercompletes.TheotherpossibilityisthatdeqreturnsEMPTYineachCRQnodeqiitentersbutneverreachestheLCRQ'stail.Eachnodeqicontainsatleastoneitem,andsothereisadequeuerdithatholdstheindextothisitem.AftertraversingthroughTnodes,whereTisthenumberofthreadsinthesystem,itmustbethatdi=djforsomej�i.Thismeansdicompletesandreturns.Overall,wehaveshownthatadequeuemustcompletewithinanitenumberofstepsbydequeuingthreads.5.EvaluationEvaluatedalgorithmsWecompareLCRQtothebestperformingqueuesreportedintherecentliterature,allofwhicharebasedonthecombiningprinciple:Hendleretal.'sFCqueue[11]andFatourouandKallimanis'CC-QueueandH-Queue[7].WealsotestMichaelandScott'sclassicnonblockingMSqueue[19].TheFCqueueisbasedonatcombining,inwhichathreadbecomesacombinerbyacquiringagloballock,andthenappliestheoperationsofthenon-combiningthreads.Thequeuewetestisalinkedlistofcyclicarrays,withanewtailarrayallocatedwhentheoldtaillls.TheCC-QueuereplaceseachofthetwolocksinMichaelandScott'stwo-lockqueue[19],whichserializeaccessestothequeue'sheadandtail,withaninstanceoftheCC-Synchuniversalconstruc-tion[7].TheCC-SynchuniversalconstructionmaintainsalinkedlisttowhichthreadsaddthemselvesusingSWAP.Thethreadattheheadofthelisttraversesthelistandperformstherequestsofwait-ingthreads.SincetheenqueueanddequeueCC-Synchinstancesworkinparallel,theCC-QueueoutperformstheFCqueue[7].TheH-QueueisahierarchicalversionoftheCC-Queue.ItusesaninstanceoftheH-Synchuniversalconstruction[7]toreplacethetwo-lockqueue'slocks.TheH-SynchconstructionconsistsofoneinstanceofCC-SynchperclusterandalockthatsynchronizestheCC-Synchinstances.EachCC-Synchcombineracquiresthelockandperformstheoperationsofthethreadsonitscluster.Toobtainthemostmeaningfulresults,weusethequeueim-plementationsfromFatourouandKallimanis'benchmarkframe-work[7,8],allofwhichareinC5.WeincorporatetheFCqueueimplementationintothisframework.LCRQimplementationWeuseCRQswhoseringsize,R,is217.(WeincludeasensitivitystudyofLCRQtotheringsizebelow.)InadditiontobaselineLCRQ,wealsoevaluateLCRQ+H,inwhichweenableourhierarchicaloptimization(withatimeoutof100ms).ToexploretheimpactofCASfailures,wetestLCRQ-CAS,aversionofLCRQinwhichweimplementtheF&AsusingaCASloop.AllLCRQvariantsincludetheoverheadofpointingahazardpointerattheCRQbeforeaccessingit6.MethodologyWefollowthetestingmethodologyofpriorwork[7,19].Wemeasurethetimeittakesforeverythreadtoexecute107pairsofenqueueanddequeueoperations,averagedover10runs.Asinpriorwork,ineverytestweavoidarticiallongrunsce-narios[19],inwhichathreadzoomsthroughmanyconsecutiveoperations,byhavingeachthreadwaitforarandomnumberofnanoseconds(upto100)betweenoperations.Eachthreadispinnedtoaspecichardwarethread,toavoidinterferencefromtheoper-atingsystemscheduler.Ourtestsusethejemalloc[6]memoryallocatortopreventmemoryallocationfrombeingabottleneck.Results'varianceisnegligible(weuseadedicatedtestmachine).PlatformWeuseaFujitsuPRIMERGYRX600S6serverwithfourIntelXeonE7-4870(WestmereEX)processors,whichwerelaunchedbyIntelinearly2011.Eachprocessorhas102.40GHzcores,eachofwhichmultiplexes2hardwarethreads,sointotaloursystemsupports80hardwarethreads.Eachcorehasprivatewrite-backL1andL2caches;aninclusiveL3cacheissharedbyallcores.Singleprocessorexecutions(Figure6a)Herewerestrictthreadstorunononeoftheserver'sprocessors.Thisevaluatesthequeuesinamodernmulticoreenvironmentinwhichallsynchronizationishandledon-chipandthushaslowcost.WeomitresultsofLCRQ+HandH-Queue,sincetheyarerelevantonlyformulti-processorexe-cutions.LCRQoutperformsallotherqueuesbeyond2threads.From10threadsonwards,LCRQoutperformsCC-Queueby1:5,theFCqueueby�2:5,andtheMSqueueby�3.LCRQ-CASmatchesLCRQ'sperformanceupto4threads,butatthatpointitsperformancelevelsoff.Subsequently,LCRQ-CASexhibitsthethroughput“meltdown”associatedwithhighlycontendedhotspots.Itsthroughputatmaximumconcurrencyis33%lowerthanitspeakperformanceat8threads.Similarly,MSqueue'sperfor-mancepeaksat2threadsanddegradesasconcurrencyincreases.Table2explainstheaboveresults.LCRQ,LCRQ-CASandtheMSqueueallcompleteinafewinstructions,butsomeofthese 5WexedamemoryleakbugintheCCandH-Queueimplementations,therebyimprovingtheirperformance.6ThisconsistsofawritingtheCRQ'saddresstoathread-privateloca-tion,issuingamemoryfence,andrereadingtheLCRQ'shead/tail. Figure7:Enqueue/dequeuethroughputonfourprocessors(threadsrunonallprocessorsfromthestart). Fourprocessorexecution(80threads)QueueinitiallyemptyQueueinitiallyfull LCRQ+HLCRQLCRQ-H-CC- LCRQ+HLCRQLCRQ-H-CC-CASQueueQueue CASQueueQueue Latency2.19ms6.20ms13.50ms3.28ms9.70ms 2.05ms5.81ms13.45ms5.19ms10.55msInstructions1456.65307.15338.985670.1716249.94 1515.60278.62293.869173.9418224.62Atomic222.881.051 222.951.051operations L1misses4.122.914.159.9910.70 3.433.014.3110.6011.33L2misses4.152.834.017.108.65 3.542.904.177.749.07L3misses0.511.472.230.345.90 0.811.432.220.956.19 Table3:Fourprocessoraverageper-operationstatistics.afterthetimeoutexpires.ThespinningtheseoperationsdowhilewaitingaccountsfortheincreasedaverageinstructioncountofLCRQ+HcomparedtoLCRQshowninTable3.Ingeneral,LCRQoperationshavebetterlatencythancombining-basedoperations,whichspendtimeeitherservicingotherthreadsorwaitingforthecombiner.Onasingleprocessor,42%ofLCRQoperationsnishin0:24mswhilenoneofthecombiningoperationsdo.Onfourprocessors,80%ofLCRQoperationsnishin9:6mscomparedto50%ofCC-Queueoperations.Similarly,80%ofLCRQ+Hoper-ationnishin0:5mscomparedto30%ofH-Queueoperations.Ringsizesensitivitystudy(Figure9)TheringsizeplaysanimportantroleintheperformanceofLCRQ.Intuitively,astheringsizedecreasesanLCRQoperationneedsmoretriesbeforeitsucceedsinperforminganenqueue/dequeuetransition.Toquantifythiseffect,wetestLCRQonaninitiallyemptyqueueatmaximumconcurrencywithvariousringsizes.Onasingleprocessor,takingR32isenoughforLCRQtooutperformtheCC-Queueby1:33.AsRincreasesLCRQ'sthroughputincreasesupto1:5thatoftheCC-Queue.Inotherwords,aslongasanindividualCRQhasroomforallrunningthreads,LCRQobtainsexcellentperformance.Onthefourprocessorbenchmarktheresultsaresimilar,butduetothehigherconcurrencylevel,LCRQoutperformsCC-QueuestartingwithR=128andtheadvantagebecomes1:5startingwithR=1024.LCRQ+HrequiresR=512tomatchH-QueueandR=4096tooutperformH-Queueby1:5.6.ConclusionWehavepresentedLCRQ,aconcurrentnonblockinglinearizableFIFOqueuethatoutperformspriorcombining-basedqueueimple-mentationsby1:5tomorethan2inallconcurrencylevelsonanx86serverwithfourmulticoreprocessors.LCRQusescontendedF&Aobjectstospreadthreadsarounditemsinthequeue,allow-ingthemtocompleteinparallel.BecausethehardwareguaranteesthateveryF&Asucceeds,weavoidthecostlyfailuresthatplagueCAS-basedalgorithms.Ourresultsshowacoupleofwaysinwhichmodernx86multi-corearchitecturerequiresreevaluatingconventionalwisdomaboutconcurrentprogramming.First,LCRQshowsthatonmodernhard-wareanalgorithmwithacontendedhotspotcanscalequitewell.Instead,itisCASretriesthatareoftenthecausefornotorious“con-tentionmeltdowns.”Second,theconventionalwisdomintheliter-ature,ofavoidingF&AorCAS2sincetheyarenotwidelysup-ported,isoutdated.Webelievetheseprinciplescanguidethede-signoffutureconcurrentalgorithms.Morepractically,theLCRQalgorithmissimpletoimplementandoffersexcellentandrobustperformanceononeoftoday'sdominantmulticorearchitecture.Wethereforehopeitgetsadoptedandusedinpractice.AcknowledgmentsMikeDodds,AndreasHaas,andChristophKirsch,andJoeIsraele-vitzandMichaelScottdiscoveredthattheproceedingsversionofthispaper–whichdidnotincludeLines146-147inFigure5–couldloseenqueueditems.VladRoubtsovpointedoutmisprintsinFigure3.ThisworkwassupportedbytheIsraelScienceFoundation(grant1386/11),bytheIsraeliCentersofResearchExcellence(I-CORE)program(Center4/11),andbyIntel'slabsupportprogram. Figure8:Cumulativedistributionofqueueoperationlatencyatmaximumconcurrency. Figure9:ImpactofringsizeonLCRQthroughput(CC-QueueandH-Queueresultsareshownforreference).References[1]PowerISAVersion2.06.http://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf,January2009.[2]G.E.Blelloch,P.B.Gibbons,andS.H.Vardhan.Combinablememory-blocktransactions.InSPAA2008.[3]G.E.Blelloch,P.Cheng,andP.B.Gibbons.Scalableroomsynchro-nizations.TheoryofComputingSystems,36,2003.[4]R.ColvinandL.Groves.Formalvericationofanarray-basednonblockingqueue.InICECCS2005.[5]D.Dice,V.J.Marathe,andN.Shavit.Lockcohorting:ageneraltechniquefordesigningNUMAlocks.InPPoPP2012.[6]J.Evans.Scalablememoryallocationusingjemalloc.http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919,2011.[7]P.FatourouandN.D.Kallimanis.Revisitingthecombiningsynchro-nizationtechnique.InPPoPP2012.[8]P.FatourouandN.D.Kallimanis.Ahighly-efcientwait-freeuniver-salconstruction.InSPAA2011.[9]E.FreudenthalandA.Gottlieb.Processcoordinationwithfetch-and-increment.InASPLOS1991.[10]A.Gottlieb,B.D.Lubachevsky,andL.Rudolph.Basictechniquesfortheefcientcoordinationofverylargenumbersofcooperatingsequentialprocessors.TOPLAS,5(2),Apr.1983.[11]D.Hendler,I.Incze,N.Shavit,andM.Tzafrir.Flatcombiningandthesynchronization-parallelismtradeoff.InSPAA2010.[12]M.Herlihy.Wait-freesynchronization.TOPLAS,13:124–149,Jan-uary1991.[13]M.HerlihyandN.Shavit.TheArtofMultiprocessorProgramming.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,2008.[14]M.P.HerlihyandJ.M.Wing.Linearizability:acorrectnessconditionforconcurrentobjects.TOPLAS,12:463–492,July1990.[15]M.Hoffman,O.Shalev,andN.Shavit.Thebasketsqueue.InOPODIS2007.[16]A.KoganandE.Petrank.Wait-freequeueswithmultipleenqueuersanddequeuers.InPPoPP2011.[17]E.Ladan-MozesandN.Shavit.Anoptimisticapproachtolock-freeFIFOqueues.InDISC2004.[18]M.M.Michael.Hazardpointers:Safememoryreclamationforlock-freeobjects.IEEETPDS,15(6):491–504,June2004.[19]M.M.MichaelandM.L.Scott.Simple,fast,andpracticalnon-blockingandblockingconcurrentqueuealgorithms.InPODC1996.[20]M.Moir,D.Nussbaum,O.Shalev,andN.Shavit.Usingeliminationtoimplementscalableandlock-freeFIFOqueues.InSPAA2005.[21]P.Sewell,S.Sarkar,S.Owens,F.Z.Nardelli,andM.O.Myreen.x86-TSO:arigorousandusableprogrammer'smodelforx86multi-processors.CommunicationsoftheACM,53(7):89–97,July2010.