/
abort).Inmanycasesthislossofopacity[14]issafebecauseofthetransactionss abort).Inmanycasesthislossofopacity[14]issafebecauseofthetransactionss

abort).Inmanycasesthislossofopacity[14]issafebecauseofthetransactionss - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
389 views
Uploaded On 2016-05-31

abort).Inmanycasesthislossofopacity[14]issafebecauseofthetransactionss - PPT Presentation

1sharedvariable2lock1bitbooleaninitiallyFALSE34lock5whileTRUE6whilelockTRUE7busywait89retXACQUIREtestsetlock10ifretFALSE11return121314unlock15XRELEASElockFALSE16 Figu ID: 342904

1sharedvariable:2lock:1bit(boolean) initiallyFALSE34lock(){5while(TRUE){6while(lock=TRUE){7//busywait8}9ret:=XACQUIREtest&set(lock)10if(ret=FALSE)11return12}}1314unlock(){15XRELEASElock:=FALSE16} Figu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "abort).Inmanycasesthislossofopacity[14]i..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

abort).Inmanycasesthislossofopacity[14]issafebecauseofthetransactionssandboxing.Software-assistedconictmanagement(SCM):TopreventthelemmingeffectinHLEtransactionswithoutresortingtolockre-moval,andwithoutlosingtheopacityproperty,weproposeasim-pleconictmanagementtechniquethatallowsthenon-conictingthreadstocontinuetheirspeculativeHLE-basedrunwithoutanyinterferencefromconictingthreads.Todothisweaddaserial-izingpathtothelockimplementation,inwhichanabortedthreadhastoacquireadistinctauxiliarylock(withoutusinglockelision)inordertorejointhespeculativeexecutionwiththeotherthreads.Usingthisapproachconictingthreadsareserializedamongthem-selvesanddonotinterferewithotherthreads.Onlyifthethreadfailsduetoaconictmanytimesitmustgiveupandacquiretheoriginallock.WhileSCMprovidesthemostbenetswhenemployedwithHLE,thetwoschemes,SCMandSLRcanbecombinedtogethertofur-therreduceanyprogressproblemscausedwhenSLRthreadsgiveupandacquirethelocknon-transactionally.Furthermore,tothebestofourknowledge,SCMistheonlyschemethatenablesHLE-basedfairlocks,withstarvationfreedomandprogressguaranteesandwithnoperformancedegradation.Weimplementedoursoftware-assistedmethodsinalibrarythatusesthestandardpthreadslockinterface.Thisallowsusingourmethodswithoutrequiringchangingorrecompilingtheprogram.Thispaper'scontributionsaretherefore:AnalyzingtheperformancedynamicsofHaswell'sHLEandquantifyingtheimpactofthelemmingeffectonit(Section4).Introducingsoftwareassistedlockremoval(SLR)thatre-gainstheconcurrencyandspeedupthatwerelostduetotheHLElemmingeffect(Section5).SLRachieveshigherlevelsofconcurrencywhilesacricingopacity.Introducingthesoftwareassistedconictmanagement(SCM)analternativeschemethatovercomesthelemmingeffectthatworkswellwithfairlocks,suchasMCS,TicketorCLHlocks(Section6).SCMretainsopacitywhilereachingslightlylowerlevelsofconcurrency(comparedtoSLR).Evaluatingthetwoschemesshowingthattheyimproveper-formancebyupto3:5timesintheSTAMPapplicationbench-marksandupto10timesindatastructuresbenchmarksascomparedtousingHaswell'sHLEasis(Section7).2.RELATEDWORKRajwarandGoodman[21]introducedtheconceptofspeculativelockelision(SLE).Theysubsequentlyproposedtransactionallockremoval[22],whichuseshardware-basedconictmanagementtoserializeconictingtransactions.Ourapproachesachieveasimilargoal,butusingsoftwaretoassistthehardwareimplementation.Diceetal.[12]studiedtransactionallockelision(TLE)usingSun'sRockprocessorandmentionedthelemmingeffect.Inre-sponse,theysketchanon-backoffsoftwaremechanismtospeeduprecoveryfromthelemmingeffect.Incontrasttotheirtechnique,ourconictmanagementschemepreventstheproblemintherstplaceandmanagestopreventthecontinuouszigzagbetweenspec-ulativeandstandardexecutionsaltogether.Implementingelision-friendlylocksusingIntel'sHaswellpro-cessorisdiscussedin[1].HoweverIntel'soptimizationguidelinesessentiallyturnfairlocksintoTTASlocks.Thishastwodisadvan-tages:(1)wastingtimewhenarrivingwhilethelockistaken(asourexperimentsonSTAMPshow,thisissignicant),and(2)thelocknolongerguaranteesstarvation-freedomandlosesitsfairness. 1sharedvariable:2lock:1bit(boolean),initiallyFALSE34lock(){5while(TRUE){6while(lock=TRUE){7//busywait8}9ret:=XACQUIREtest&set(lock)10if(ret=FALSE)11return12}}1314unlock(){15XRELEASElock:=FALSE16} Figure1:ApplyinghardwarelockelisiontoaTTAS(Test&Test&Set)lock.Royetal.[23]andAfeketal.[5]implementlockelisioncom-pletelyinsoftware,usingspecializedsoftwaretransactionalmem-oryalgorithms.Theseimplementationsinstrumentthecriticalsec-tionstotrackreadandwrittenmemorylocations.Incontrast,ouralgorithmsarebasedonhardwareTMandthusdonotrequiresuchinstrumentation.Wehavesketchedthesoftware-assistedconictmanagementinapreviousposterpublication[3].Concurrentlytothecurrentwork,Calciuetal.[9]proposedusingamechanismsimilartosoftware-assistedlockremovalasafallbackfortransactionalmemorysys-tems.3.BACKGROUND:INTEL'SHASWELLHTMIntel'stransactionalsynchronizationextensions(TSX)[2]de-nestwointerfacestodesignatethescopeofatransaction:Hardwarelockelision(HLE):InHLE,thescopeofalock-protect-edcriticalsectiondenesatransaction'sscope.HLEisimplementedasabackward-compatibleinstructionsetextensionoftwonewpre-xes,XACQUIREandXRELEASE.UponexecutinganXACQU-IRE-prexedinstructionthatwritestomemory(e.g.,astoreorcompare-and-swap)(seeFigure1),theprocessorstartsatrans-actionandelidestheactualstore,treatingitasatransactionalreadinstead(i.e.,placingthelockinthereadset).Internally,however,theprocessormaintainsanillusionthatthelockwasacquired:ifthetransactionreadsthelock,itseesthevaluestoredlocally.UponexecutinganXRELEASEstore,thetransactioncommits.HLEre-quiresthatanXRELEASEstorerestoresthelocktoitsoriginalstate;otherwise,itabortsthetransaction.IfanHLEtransactionaborts,theXACQUIREstoreisre-executednon-transactionallytoacquirethelockinordertoensureprogress.Noticethatsuchanon-transactionalstoreconictswitheveryconcurrentHLEtransactionelidingthesamelock,sincesuchatransactionhasthelock'scachelineinitsreadset.Thisistherootcauseforthelemmingeffect.Restrictedtransactionalmemory(RTM):RTMisHaswell'sgenericTMinterfacewiththreenewinstructions:XBEGIN,XEND,andXABORT.XBEGINbeginsatransaction,XENDcommits,andXABORTallowsatransactiontoabortitself.Uponanabortafall-backcodethatispointedbyanoperandoftheXBEGINinstructionisexecutedandusesanabortstatusregisterinwhichtheprocessorrecordsthecausefortheabort,whetherduetoanXABORT,adataconict,oran“internalbufferoverow”[2].RTMcanbeusedtoimplementcustomlockelisionalgorithms[1]byreplacingthelockacquisitioncodewithcustomcodethatbeginsatransactionandreadsfromthelock'scacheline.However,suchalockelisionschemefailstomaintaintheillusionthatthethread wrotetothelock,asthelock'scachelineisindistinguishablefromanyotherlineintheread-set.3.1Haswell'sTSXImplementationHaswellappearstousearequestorwinsconictmanagementpolicy.Atransactionabortsifeitheracoherencymessage(readorwrite)arrivesforacachelineinitswriteset,orifanevictionduetoawritearrivesforacachelineinitsreadset.Experimentsweconducted(describedin[4])showthattrans-actionsarepronetospuriousabortsthatarenotexplainedbydataconictsorread/writesetoverow.Spuriousabortsimplythateveninaperfectconictfreeworkload,degradationsuchasthelemmingeffect,describedbelow,ispossible.HLEcompatiblelocksHaswell'sHLEmechanismconservativelyrequiresthatthestorereleasingthelockrestoresthelocktoitsoriginalstatepriortotheacquisition[2].Unfortunately,thepop-ular(fair)ticketlock[18](usedintheLinuxkernel[20])andCLHlock[17,11]donotmeetthisrequirement.Asanadditionalcon-tribution,weadapttheselocksforuseunderHLE,thusenablingHLE-basedcodetomaintaintheprogressguaranteesfairlockspro-vide,andmakingHLEapplicabletoprogramsthatuseticketlocksorCLHlocks.Weadjustbothlocksinawaythatguaranteesthatathreadrun-ningalone(whichistheillusiongivenbyHLE)restoresthelocktoitsoriginalstatewhenitreleasesthelock.Theideaisthatathreadreleasingthelockrsttriestooptimisticallyrestoretheorig-inalstateusingacompare-and-swapinstruction.Ifthisfailsthethreadrevertstousingthestandardlockalgorithm.ButiftheCASsucceeds,thelock'sstateisrestored,whichisexactlywhatHLErequires.ThealgorithmsappearinAppendixAand(duetospaceconstraints)theircorrectnessproofsappearin[4].4.LEMMINGEFFECTINHASWELLHLEInthissectionweexperimentallyquantifytheserializationpenaltyduetotransactionalabortsduringanHLEexecution.WefocusouranalysisontheHLE-basedtest-and-test-and-set(TTAS)lock(Fig-ure1)andthefairHLE-basedMCS[18]lock.WeusetheMCSlockastherepresentativeoftheclassoffairlocksbecauseitiscompat-iblewithHLE,unlikeotherfairlockssuchasticketlocksorCLHlocks.However,wehaveveriedthatboththeselockssufferfromthesameproblemsreportedbelowfortheMCSlock.Weuseared-blacktreedatastructureprotectedbyasinglegloballock.Varyingthenumberofthreads,theoperationmix,andthetreesizeallowsustocontroltheconictlevelandthelengthandamountofdataaccessedinthecriticalsection.Smalltreeand/ormanymu-tatinginsert/deletethreadsresultinhigherconictlevels.In-creasingthesizeofthetreereducesthechancethattwooperations'dataaccessesconict,astheelementsaccessedaremoresparselydistributed.Foragivensize,s,weinitiallyllthetreewithran-domelementsfromadomainofsize2s.Then,werunforaperiodof3secondsinwhicheachthreadcontinuouslyperformsrandominsert,deleteandlookupoperations,accordingtoaspecieddistribution.(Weuseanequalrateofinsertsanddeletessothatonaveragethetreesizedoesnotchange.)ExperimentswereperformedonaCorei7-47703.4GHzHaswellprocessor,with4cores,eachwith2hyperthreads.EachcorehasprivateL1andL2caches,whosesizesare32KBand256KBre-spectively.Thereisalsoan8MBL3cachesharedbyallcores.Eachtestpointistheaverageon10runs(withlittleobservedvari-ance).Wemeasured:(1)thetotalnumberofoperationscompleted,(2)S,thenumberofsuccessfulspeculativeoperations,(3)A,thenumberofabortedspeculativeoperationsand(4)N,thenumberof Figure2:Impactofabortsonexecutionsunderdifferentlockimplemen-tations.Foreachtreesizeweshowtheaveragenumberoftimesathreadattemptstoexecutethecriticalsectionuntilsuccessfullycompletingatreeoperation,andthefractionofoperationsthatcompletenon-speculatively.CLHandticketresultsareomitted,astheyaresimilartotheMCSlockresults.operationsthatcompleteviaanon-speculativeexecution.ThetotalnumberofoperationsperformedisS+N.Insomelockimplemen-tationsanoperationcanstartandabortseveralspeculationattemptsbeforecompleting,sothereisnoformularelatingAtoSandN.Figure2showstheamountofserializationcausedbyaborts,asafunctionofthetreesize,foramoderateleveloftreemodications(20%).Inadditiontothefractionofoperationsthatcompletenon-speculatively(i.e.,N N+S),wereporttheamountofworkrequiredtocompleteanoperation,i.e.,A+N+S N+S,thenumberoftimesathreadtriestocompletethecriticalsectionbeforesucceeding.AsFigure2shows,theserializationdynamicsforeachlocktypearequitedifferent.WithanMCSlock,thebenchmarkexecutesvir-tuallyalloperationsnon-speculativelyafteraninitialspeculativesectionaborts.Asaresult,anHLEMCSlockofferslittleifanyspeedupoverastandardMCSlock,evenwhenthereislittleunder-lyingcontention.TheTTASlock,ontheotherhand,managestorecoverfromaborts.Athighconictlevels(onsmalltrees)itrequires2�3:5attemptstocompleteasingleoperation,butneverthelessafractionof30%to70%oftheoperationscompletespeculatively.Asthetreesizeincreasesandconictlevelsdecrease,HLEshinesandnearlyalloperationscompletespeculatively.Wenowturntoanalyzethecausesforthesedifferences.TTASspinlock(theboxedlineinFigure2)Therstthreadtoabortacquiresthelocknon-speculatively.Asfortheremain-ingthreads,wedistinguishbetweentwobehaviors.First,athreadthatabortsbecauseofthislockacquisitionre-executesitsacquir-ingTASinstruction,whichreturns1becausethelockisheld.Thethreadthenspins,andonceitobservesthelockfreere-issuesits MCSlock:Alloperationscompletenon-speculatively.TTASlock:Mostoperationscompletespeculativelybutthereareperiodsofserialization.Figure3:Normalizedthroughputandserializationdynamicsovertime.Wedividetheexecutioninto1millisecondtimeslots.Top:Throughputobtainedineachtimeslot,normalizedtotheaveragethroughputovertheentireexecution.Bottom:Fractionofoperationsthatcompletenon-speculativelyineachtimeslot. Figure4:TheHLEspeedupof8threadswithdifferenttypesoflocks.Thebase-lineofeachspeeduplineisthestandardversionofthatspeciclock(thehorizontaldottedblacklineaty=1).Bymixingdifferentaccessoperationswevarytheamountofcontention:(i)lookupsonly–nocontention,(ii)moderatecontention–10%ofthetreeaccessesareinsertionsand10%aredeletionsand(iii)extensivecontention–alltheaccessesareinsertionsordeletions.XACQUIRETASandre-entersaspeculativeexecution.Second,anewlyarrivingthreadinitiallyobservesthelockastakenandspins.Oncethethreadinthecriticalsectionreleasesthelock,thewaitingthreadissuesanXACQUIRETASasintherstcase.Thebottomlineisthatallthreadsareblockedfromenteringaspeculativeexe-cutionuntiltheinitialabortedthreadexitsthecriticalsection,butthenallthethreadsresumeexecutionspeculatively.Theipsideofthisbehavioristhatathreadmaythusabortseveraltimesbe-foresuccessfullycompletingitsoperation,eitherspeculativelyornon-speculatively.MCSFairlock(thecircledlineinFigure2)TheMCSlockrep-resentsthelockasalinkedlistofnodes,whereeachnoderepre-sentsathreadwaitingtoacquirethelock.AnarrivingthreadusesanatomicSWAP[15]toatomicallyappenditsownnodetothetailofthequeue,andintheprocessretrievesapointertoitspredecessorinthequeue.Itthenspinsonthelockedeldofitsnode,waitingforitspredecessortosetthiseldtofalse.InthecaseoftheHLE-basedMCSlockthelemmingeffectismuchworsebecauseallthethreadsthatwereabortedformachain,eachspinningonadifferentlocationwaitingforthepredecessortoreleasethelock.Neitheranabortedthreadnornewlyarrivingthreadcannowenterthecriticalsectionspeculatively.Ineithercasethethreadspinsandonceitsturnarrivesentersthecriticalsectionnon-speculatively.Thus,asingleabortcausestheserializationofallconcurrentcriticalsections,aswellasnewlyarrivingthreads,allofwhichwillnowexecutenon-speculatively.Essentially,becauseofthefairnessguaranteesprovidedbytheMCSlock,it“remem-bers”conicteventsandmakesithardertoresumeaspeculativeexecution.Evenwhentheoriginallockholderreleasesthelock,itmovesitintoastatethatdoesnotallownewthreadstospeculativelyexecute.TheMCSlockrequiresaquiescenceperiod,inwhichnonewthreadsarrive,sothatallwaitingthreadsacquirethelock,ex-ecutethecriticalsectionandleave.OnlythendoestheMCSlockreturntoastatethatallowsthenextarrivingthreadsexecutespec-ulatively.PerformanceimpactInFigure3wedividethebenchmark'sex-ecutioninto1millisecondtimeslotsandshowthethroughputob-tainedineachslot,normalizedtothethroughputovertheentireexecution.Wealsoshowthefractionofoperationsthatcompletedviaanon-speculativeexecutionineachtimeslot.Ascanbeseen,TTASperformancecanuctuateseverely,sometimesfallingbyasmuchas2:5.Thesethroughputdropsarecorrelatedwithperiodsinwhichmorecriticalsectionsnishnon-speculatively,i.e.,afterserializationcausedbyanabort.TheMCSperformancereinforcestheresultsofthepreviousbenchmark(Figure2):thebenchmarkexecutesvirtuallyalloperationsnon-speculativelyduetoserializa-tioncausedbyanabort.Finally,Figure4depictstheperformanceadvantageofthelockelisionusagewithdifferenttypesoflocks. ScenariosendinginabortSuccessfullycommittingscenarioT1 T2 beginSLRtxn load(X) lock(L) store(Y)load(Y) load(L) Lislocked:ABORT T1 T2 beginSLRtxn load(X) lock(L) store(Y) store(X)ABORT:conict onX T1 T2 beginSLRtxn lock(L) store(Y) store(X)load(X) unlock(L)load(Y) load(L) COMMIT Figure6:HowSLRenforcescorrectexecutionsindifferentscenarios.avoidimpactontheotherspeculativelyrunningthreadsinthesys-tem.Thisschemeiscompatiblewithanylockimplementation,andresolvesthelemmingeffectprobleminHLEtransactionswithoutresortingtousinglockremoval,andwithoutsacricingopacity.Forexample,whenusinglockremovalinhighlycontendedwork-loadswithlotsofaborts,newthreadskeepstartingtransactionsandcausingmoreabortsandwastedwork.HereHLE'sabilitytograbthelockandstopnewlyarrivingthreadsfromenteringspeculativeexecutionturnsouttobeextremelyhelpfulandgreatlyimprovesperformance(asshowninSection7).OurconictmanagementschemethusmaintainsthisabilityofHLE,andpreventsthelem-mingeffectinlesscontendedscenarios.Preventinglivelock,theSCMscheme:Ourschemeusestwolocks,theoriginalmainlockwhichistakenusingtheHLE/SLRmecha-nismandanauxiliarystandardlockwhichisonlyacquiredinastandardnon-transactionalmanner.Theauxiliarylockgroupsallthethreadsthatareinvolvedinaconictandserializesthem(seeFigure8).Whenatransactionisaborted,theabortedthreadnon-transactionallyacquirestheauxiliarylockandthenrejoinsthespec-ulativeexecutionoftheoriginalcriticalsection.Theprocessofac-quiringtheauxiliarylockinordertorejointhespeculativeruniscalledtheserializingpath.Aswithpreviousschemes,thethreadmayretryitstransactionbeforegoingtotheserializingpath.Toseewhythisschemepreventslivelock,considertwotransac-tions,T1andT2,whichrepeatedlyaborteachother.OnceT1ac-quirestheauxiliarylockandre-joinsthespeculativeexecution,oneofthefollowingcanhappen:(1)T1abortsagain,butT2commits,or(2)T2abortsandthustriestoacquiretheauxiliarylock,whereitmustwaitforT1tocommitandreleasetheauxiliarylock.Gener-alizingthis,onceathreadTacquirestheauxiliarylockanytrans-actionthatconictswithTeithercommitsorgetsserializedtorunafterT.Thusthesystemmakesprogress.PreventingstarvationIntheaboveschemestarvationremainspos-sibleduetooneoftwoscenarios:(1)athreadfailstoacquiretheauxiliarylock(ascanhappenwithaTTASlock),or(2)athreadholdingtheauxiliarylockfailstocommit.Tosolveissue(1)werequirethattheauxiliarylockbeastarvation-free(or“fair”)lock,suchasanMCSlock.Ourschemetheninheritsanyfairnessprop-ertiesoftheauxiliarylock.Tosolveissue(2),theauxiliarylockholdernon-transactionallyacquiresthemainlockafterfailingtocommitagivennumberoftimes.IfallaccessestothemainlockgothroughtheHLE/SLRmechanism,thenonlytheauxiliarylockholdercanevertrytoacquirethemainlockandisthereforeguar-anteedtosucceed.Otherwise(i.e.,iftheprogramsometimesex-plicitlyacquiresthelocknon-transactionally),themainlockmustbestarvation-freeaswell.WhileSCMprovidesthemostbenetswhenemployedwithHLE,thetwoschemes,SCMandSLRcanbecombinedtogethertofur- 1sharedvariables:2main_lock:elidedmainlock3aux_lock:auxiliarystandardlock45threadlocalvariables:6retries:int7aux_lock_owner:boolean,initiallyFALSE89lock(){10retries:=011//primarypath12XBEGIN(Line17)//jumptoLine17onabort13callHLEorSLRlock()asappropriate 14return1516//serializingpath17if(aux_lock_owner=FALSE){18retries++19}else{20aux_lock.lock()//standardlockacquire21aux_lock_owner:=TRUE22}23if(retriesMAX_RETRIES)24gotoLine1225else26main_lock.lock()//standardlockacquire27}2829unlock(){30if(XTEST()){//returnsTRUEiftherunisspeculative31callHLEorSLRunlock()asappropriate 32XEND33}else{34main_lock.unlock()//standardlockrelease35}36if(aux_lock_owner=TRUE){37aux_lock.unlock()//standardlockrelease38aux_lock_owner:=FALSE39}40} Figure7:Software-AssistedConictManagementtherreduceanyprogressproblemscausedwhenSLRthreadsgiveupandacquirethelocknon-transactionally.ImplementationandHLEcompatibility(Figure7)OurschememaintainsHLE-compatibilitybynestinganHLEtransactionwithinanRTMtransaction.WhenusedwithHLE,werststartanRTMtransactionwhich“acquires”thelockwithanXACQUIREstore.BecauseTSXprovidesaatnestingmodel[2],anabortwillaborttheparentRTMtransactionandexecutethefall-backcodeinsteadofre-issuingtheXACQUIREstoreandabortingalltherunningtransactions. [13]N.DieguesandP.Romano.Time-warp:LightweightAbortMinimizationinTransactionalMemory.InPPoPP2014.[14]R.GuerraouiandM.Kapalka.Onthecorrectnessoftransactionalmemory.InPPoPP2008.[15]M.Herlihy.Wait-freesynchronization.ACMTOPLAS,13:124–149,January1991.[16]M.HerlihyandJ.E.B.Moss.Transactionalmemory:architecturalsupportforlock-freedatastructures.InISCA1993.[17]P.S.Magnusson,A.Landin,andE.Hagersten.Queuelocksoncachecoherentmultiprocessors.InISPP'94.[18]J.M.Mellor-CrummeyandM.L.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiprocessors.ACMTOCS,9(1):21–65,Feb.1991.[19]D.Papagiannopoulou,G.Capodanno,R.I.Bahar,T.Moreshet,A.Holla,andM.Herlihy.Energy-EfcientandHigh-PerformanceLockSpeculationHardwareforEmbeddedMulticoreSystems.InTRANSACT2013.[20]N.Piggin.x86:FIFOticketspinlocks.http://lkml.org/lkml/2007/11/1/125,2007.[21]R.RajwarandJ.R.Goodman.SpeculativeLockElision:enablinghighlyconcurrentmultithreadedexecution.InMICRO2001.[22]R.RajwarandJ.R.Goodman.Transactionallock-freeexecutionoflock-basedprograms.InASPLOS2002.[23]A.Roy,S.Hand,andT.Harris.Aruntimesystemforsoftwarelockelision.InEuroSys2009.[24]A.Wang,M.Gaudet,P.Wu,J.N.Amaral,M.Ohmacht,C.Barton,R.Silvera,andM.Michael.EvaluationofBlueGene/Qhardwaresupportfortransactionalmemories.InPACT2012.APPENDIXA.ADJUSTINGLOCKSTOWORKWITHHLE:DETAILSTicketLock(Figure12)AdjustmentsThenewimplementation(Figure13)handlesbothspeculativeandstandard(non-speculative)runs.Weuseacompare-and-swap(CAS)primitive[15]inordertodistinguishbetweenthetwocases:thereleaseattemptstoCASthelockbacktoitsoriginalvalue,i.e.,decrementthenextcounterin-steadofincrementingtheowner.Ifsuccessful,itremovesalltracesofthelockacquisition;thisoccursineitheraspeculativeexecutionoranon-speculativesingle-threadexecution.AnunsuccessfulCASindicatesastandardrunwithmultiplerequesters.TheonlydifferenceinthelockacquiringfunctionistheXAC-QUIREusage.Intheadjustedunlockfunction(Figure13),ifLine8issuccessfullyexecuted,either:(1)thelockistakeninastandardmannerandthelockowneristheonlyrunningthread(nootherre-questers)or(2)thelockistakeninspeculativemannerandthelockowner(canbeoneofmany)removesalltracesofitsrun.Line9isusedtoreleasethelockwhenthelockistakeninastandardman-nerandthelockownerisnottheonlyrequester.Thisbehaviorisidenticaltotheoriginalimplementation.CLHLock(Figure14)AdjustmentsThepseudo-codeoftheCLHlockimplementation,adjustedtotheHLEmechanism,isdepictedinFigure15.Asintheticketlock,weneedtoadjusttheCLHlocksothatthelockrevertstoitsoriginalstatewhenreleasedinasolorun.Again,weuseaCAStodothis,inanattempttoplacepredatthetailofthequeue,effectivelyerasingthepresenceofournode. 1sharedvariables:2next:integer,initially03owner:integer,initially045lock(){6current:=F&A(&next,1)7while(owner6=current){8//busywait9}}1011unlock(){12owner:=owner+113} Figure12:Ticketlock. 1lock(){2current:=XACQUIREF&A(&next,1)3while(owner6=current){4//busywait5}}67unlock(){8if(!XRELEASECAS(&next,owner+1,owner)){9owner:=owner+110}} Figure13:Lockelisionadjustedticketlock. 1sharedvariable:2structNode{3locked:1bit(boolean)4next:pointertoNode5}6tail:pointertoNode,initiallypointstoT=FALSE;NULL&#x]TJ/;ཱ ;.97; T; 7.;। ;� Td;&#x [00;78thread-localvariables:9myNode:pointertoNode,initiallypointstothread�localNode10pred:pointertoNode1112lock(){13myNode.locked:=TRUE14pred:=SWAP(&tail,myNode)15while(pred.locked){16//busywait17}}1819unlock(){20myNode.locked:=FALSE21myNode:=pred22} Figure14:CLHlock. 1lock(){2myNode.locked:=TRUE3pred:=XACQUIRESWAP(&tail,myNode)4while(pred.locked){5//busywait6}}78unlock(){9if(!XRELEASECAS(&tail,myNode,pred)){10myNode.locked:=FALSE11myNode:=pred12}} Figure15:LockelisionadjustedCLHlock.