/
Transactional Locking II Dave Dice  Ori Shalev  and Nir Shavit Sun Microsystems Laboratories Transactional Locking II Dave Dice  Ori Shalev  and Nir Shavit Sun Microsystems Laboratories

Transactional Locking II Dave Dice Ori Shalev and Nir Shavit Sun Microsystems Laboratories - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
476 views
Uploaded On 2014-12-19

Transactional Locking II Dave Dice Ori Shalev and Nir Shavit Sun Microsystems Laboratories - PPT Presentation

com TelAviv University TelAviv 69978 Israel orishposttauacil Abstract The transactional memory programming paradigm is gain ing momentum as the approach of choice for replacing locks in concurrent programming This paper introduces the transactional l ID: 26413

com TelAviv University TelAviv 69978

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Transactional Locking II Dave Dice Ori ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

TransactionalLockingIIDaveDice1,OriShalev2;1,andNirShavit11SunMicrosystemsLaboratories,1NetworkDrive,BurlingtonMA01803-0903,fdice,shanirg@sun.com2Tel-AvivUniversity,Tel-Aviv69978,Israel,orish@post.tau.ac.ilAbstract.Thetransactionalmemoryprogrammingparadigmisgain-ingmomentumastheapproachofchoiceforreplacinglocksinconcurrentprogramming.ThispaperintroducesthetransactionallockingII(TL2)algorithm,asoftwaretransactionalmemory(STM)algorithmbasedonacombinationofcommit-timelockingandanovelglobalversion-clockbasedvalidationtechnique.TL2improvesonstate-of-the-artSTMsinthefollowingways:(1)unlikeallotherSTMsit tsseamlesslywithanysystemsmemorylife-cycle,includingthoseusingmalloc/free(2)unlikeallotherlock-basedSTMsitecientlyavoidsperiodsofunsafeexecution,thatis,usingitsnovelversion-clockvalidation,usercodeisguaranteedtooperateonlyonconsistentmemorystates,and(3)inasequenceofhighperformancebenchmarks,whileprovidingthesenewproperties,itdeliv-eredoverallperformancecomparableto(andinmanycasesbetterthan)thatofallformerSTMalgorithms,bothlock-basedandnon-blocking.Perhapsmoreimportantly,onvariousbenchmarks,TL2deliversperfor-mancethatiscompetitivewiththebesthand-crafted ne-grainedcon-currentstructures.Speci cally,itisten-foldfasterthanasinglelock.WebelievethesecharacteristicsmakeTL2aviablecandidatefordeploy-mentoftransactionalmemorytoday,longbeforehardwaretransactionalsupportisavailable.1IntroductionAgoalofcurrentmultiprocessorsoftwaredesignistointroduceparallelismintosoftwareapplicationsbyallowingoperationsthatdonotcon\rictinaccessingmemorytoproceedconcurrently.Thekeytoolindesigningconcurrentdatastructureshasbeentheuseoflocks.Coarse-grainedlockingiseasytoprogram,butunfortunatelyprovidesverypoorperformancebecauseoflimitedparallelism.Fine-grainedlock-basedconcurrentdatastructuresperformexceptionallywell,butdesigningthemhaslongbeenrecognizedasadiculttaskbetterlefttoexperts.Ifconcurrentprogrammingistobecomeubiquitous,researchersagreethatalternativeapproachesthatsimplifycodedesignandveri cationmustbedeveloped.Thispaperisinterestedin\mechanical"methodsfortransformingsequentialcodeorcoarse-grainedlock-basedcodeintoconcurrentcode.Byme-chanicalwemeanthatthetransformation,whetherdonebyhand,byaprepro-cessor,orbyacompiler,doesnotrequireanyprogramspeci cinformation(such astheprogrammer'sunderstandingofthedata\rowrelationships).Moreover,wewishtofocusontechniquesthatcanbedeployedtodeliverreasonableperfor-manceacrossawiderangeofsystemstoday,yetcombineeasilywithspecializedhardwaresupportasitbecomesavailable.1.1TransactionalProgrammingThetransactionalmemoryprogrammingparadigmofHerlihyandMoss[1]isgainingmomentumastheapproachofchoiceforreplacinglocksinconcurrentprogramming.Combiningsequencesofconcurrentoperationsintoatomictrans-actionsseemstopromiseagreatreductioninthecomplexityofbothprogram-mingandveri cation{bymakingpartsofthecodeappeartobesequentialwith-outtheneedtoprogram ne-grainedlocks.Transactionswillhopefullyremovefromtheprogrammertheburdenof guringouttheinteractionamongconcur-rentoperationsthathappentocon\rictwitheachother.Non-con\rictingTrans-actionswillrununinterruptedinparallel,andthosethatdowillbeabortedandretriedwithouttheprogrammerhavingtoworryaboutissuessuchasdeadlock.Therearecurrentlyproposalsforhardwareimplementationsoftransactionalmemory(HTM)[1{4],purelysoftwarebasedones,i.e.softwaretransactionalmemories(STM)[5{13],andhybridschemes(HyTM)thatcombinehardwareandsoftware[14,10].3Thedominanttrendamongtransactionalmemorydesignsseemstobethatthetransactionsprovidedtotheprogrammer,ineitherhardwareorsoftware,shouldbe\largescale",thatis,unbounded,anddynamic.Unboundedmeansthatthereisnolimitonthenumberoflocationsaccessedbythetransaction.Dynamic(asopposedtostatic)meansthatthesetoflocationsaccessedbythetransactionisnotknowninadvanceandisdeterminedduringitsexecution.Providinglargescaletransactionsinhardwaretendstointroducelargede-greesofcomplexityintothedesign[1{4].Providingthemecientlyinsoftwareisadiculttask,andthereseemtobenumerousdesignparametersandap-proachesintheliterature[5{11].aswellasrequirementstocombinewellwithhardwaretransactionsoncethosebecomeavailable[14,10].1.2Lock-BasedSoftwareTransactionalMemorySTMdesignhascomealongwaysincethe rstSTMalgorithmbyShavitandTouitou[12],whichprovidedanon-blockingimplementationofstatictransac-tions(see[5{8,15,9{13]).ArecentpaperbyEnnals[5]suggestedthatonmodernoperatingsystemsdeadlockpreventionistheonlycompellingreasonformakingtransactionsnon-blocking,andthatthereisnoreasontoprovideitfortransac-tionsattheuserlevel.Wesecondthisclaim,notingthatmechanismsalreadyexistwherebythreadsmightyieldtheirquantatootherthreadsandthatSo-laris'schedctlallowsthreadstotransientlydeferpreemptionwhileholdinglocks. 3Abroadsurveyofpriorartcanbefoundin[6,15,16]. Ennals[5]proposedanall-softwarelock-basedimplementationofsoftwaretrans-actionalmemoryusingtheobject-basedapproachof[17].Hisideawastorunthroughthetransactionpossiblyoperatingonaninconsistentmemorystate,ac-quiringwritelocksaslocationstobewrittenareencountered,writingthenewvaluesinplaceandhavingpointerstoanundosetthatisnotsharedwithotherthreads.Theuseoflockseliminatestheneedforindirectionandsharedtrans-actionrecordsasinthenon-blockingSTMs,itstillrequireshoweveraclosedmemorysystem.Deadlocksandlivelocksaredealtwithusingtimeoutsandtheabilityoftransactionstorequestothertransactionstoabort.AnotherrecentpaperbySahaetal.[11],usesaversionofEnnals'lock-basedalgorithmwithinarun-timesystem.TheschemedescribedbySahaetal.acquireslocksastheyareencountered,butalsokeepssharedundosetstoallowtransactionstoactivelyabortothers.Aworkshoppresentationbytwooftheauthors[18]showsthatlock-basedSTMstendtooutperformnon-blockingonesduetosimpleralgorithmsthatresultinloweroverheads.However,twolimitationsremain,limitationsthatmustbeovercomeifSTMsaretobecommerciallydeployed:ClosedMemorySystemsMemoryusedtransactionallymustberecyclabletobeusednon-transactionallyandviceversa.Thisisrelativelyeasyingarbagecollectedlanguages,butmustalsobesupportedinlanguageslikeCwithstandardmalloc()andfree()operations.Unfortunately,allnon-blockingSTMdesignsrequireclosedmemorysystems,andthelock-basedSTMs[5,11]eitheruseclosedsystemsorrequirespecializedmalloc()andfree()oper-ations.SpecializedManagedRuntimeEnvironmentsCurrentecientSTMs[5,11]requirespecialenvironmentscapableofcontainingirregulare ectsinordertoavoidunsafebehaviorresultingfromtheiroperatingoninconsistentstates.TheTL2algorithmpresentedinthispaperisthe rstSTMthatovercomesbothoftheselimitations:itworkswithanopenmemorysystem,essentiallywithanytypeofmalloc()andfree(),anditrunsusercodeonlyonconsistentstates,eliminatingtheneedforspecializedmanagedruntimeenvironments4.1.3VulnerabilitiesofSTMsLetusexplaintheabovevulnerabilitiesinmoredetail.CurrentecientSTMimplementations[18,17,5,11]requireclosedmemorysystemsaswellasman-agedruntimeenvironmentscapableofcontainingirregulare ects.Theseclosedsystemsandmanagedenvironmentsarenecessaryforecientexecution.Withintheseenvironments,theyallowtheexecutionof\zombies":transactionsthathaveobservedaninconsistentread-setbuthaveyettoabort.Therelianceonanaccumulatedread-setthatisnotavalidsnapshot[19]ofthesharedmemory 4TheTLalgorithm[18],aprecursorofTL2,workswithanopenmemorysystembutrunsoninconsistentstates. locationsaccessedcancauseunexpectedbehaviorsuchasin niteloops,illegalmemoryaccesses,andotherrun-timemisbehavior.Thespecializedruntimeenvironmentabsorbstraps,convertingthemtotrans-actionretries.Handlingin niteloopsinzombiesisusuallydonebyvalidatingtransactionswhileinprogress.Validatingtheread-setoneverytransactionalloadwouldguaranteesafety,butwouldalsosigni cantlyimpactperformance.Anotheroptionistoperformperiodicvalidations,forexample,onceeverynum-beroftransactionalloadsorwhenloopingintheusercode[11].Ennals[5]at-temptstodetectin niteloopsbyhavingeveryn-thtransactionalobject\open"operationvalidatepartoftheaccumulatedread-set.Unfortunately,thispol-icyadmitsin niteloops(asitispossibleforatransactiontoreadlessthanninconsistentmemorylocationsandcausethethreadtoenteranin niteloopcontainingnosubsequenttransactionalloads).Ingeneral,in niteloopdetectionmechanismsrequireextendingthecompilerortranslatortoinsertvalidationchecksintopotentialloops.ThesecondissuewithexistingSTMimplementationsistheirneedforaclosedmemoryallocationsystem.Fortype-safegarbagecollectedmanagedruntimeen-vironmentssuchasthatoftheJavaprogramminglanguage,thecollectorassuresthattransactionallyaccessedmemorywillonlybereleasedoncenoreferencesremaintotheobject.However,inCorC++,anobjectmaybefreedanddepartthetransactionalspacewhileconcurrentlyexecutingthreadscontinuetoaccessit.Theobject'sassociatedlock,ifusedproperly,cano erawayaroundthisproblem,allowingmemorytoberecycledusingstandardmalloc/freestyleop-erations.Therecycledlocationsmightstillbereadbyaconcurrenttransaction,butwillneverbewrittenbyone.1.4OurNewResultsThispaperintroducesthetransactionallockingII(TL2)algorithm.TL2over-comesthedrawbacksofallstate-of-the-artlock-basedalgorithms,includingourearlierTLalgorithm[18].ThenewideainournewTL2algorithmistohave,perhapscounter-intuitively,aglobalversion-clockthatisincrementedoncebyeachtransactionthatwritestomemory,andisreadbyalltransactions.Weshowhowthissharedclockcanbeconstructedsothatforallbuttheshortesttransactions,thee ectsofcontentionareminimal.Wenotethatthetechniqueoftime-stampingtransactionsiswellknowninthedatabasecommunity[20].Aglobal-clockbasedSTMisalsoproposedbyRiegeletal.[21].Ourglobal-clockbasedalgorithmdi ersfromthedatabaseworkinthatitistailoredtobehighlyecientasrequiredbysmallSTMtransactionsasopposedtolargedatabaseones.Itdi ersfromthe\snapshotisolation"algorithmofRiegeletal.asTL2islock-basedandverysimple,whileRiegeletal.isnon-blockingbutcostlyasitusestime-stampstochoosebetweenmultipleconcurrentcopiesofatransactionbasedontheirassociatedexecutionintervals.InTL2,allmemorylocationsareaugmentedwithalockthatcontainsaversionnumber.Transactionsstartbyreadingtheglobalversion-clockandval-idatingeverylocationreadagainstthisclock.Asweprove,thisallowsusto guaranteeataverylowcostthatonlyconsistentmemoryviewsareeverread.Writingtransactionsneedtocollectaread-setbutread-onlyonesdonot.Onceread-andwrite-setsarecollected,transactionsacquirelocksonlocationstobewritten,incrementtheglobalversion-clockandattempttocommitbyvalidatingtheread-set.Oncecommitted,transactionsupdatethememorylocationswiththenewglobalversion-clockvalueandreleasetheassociatedlocks.WebelieveTL2isrevolutionaryinthatitovercomesmostofthesafetyandperformanceissuesthathaveplaguedhigh-performancelock-basedSTMimplementations:{Unlikeallformerlock-basedSTMsitecientlyavoidsvulnerabilitiesrelatedtoreadinginconsistentmemorystates,nottomentionthefactthatformerlock-basedSTMsmustusecompilerassistormanualprogrammerinterven-tiontoperformvaliditytestsinusercodetotryandavoidasmanyofthesezombiebehaviorsaspossible.Theneedtoovercomethesesafetyvulnerabili-tieswillbeamajorfactorwhengoingfromexperimentalalgorithmstoactualproductionqualitySTMs.Moreover,asSahaetal.[11]explain,validationintroducedtolimitthee ectsofthesesafetyissuescanhaveasigni cantimpactonoverallSTMperformance.{UnlikeanyformerSTM,TL2allowstransactionalmemorytoberecycledintonon-transactionalmemoryandbackusingmallocandfreestyleoperations.Thisisdoneseamlesslyandwithnoaddedcomplexity.{AsweshowinSection3,ratherencouragingly,concurrentred-blacktreesderivedinamechanicalfashionfromsequentialcodeusingtheTL2algo-rithmandprovidingtheabovesoftwareengineeringbene ts,tendtoper-formaswellasprioralgorithms,exhibitingperformancecomparabletothatofhand-crafted ne-grainedlock-basedalgorithms.OverallTL2isanorderofmagnitudefasterthansequentialcodemadeconcurrentusingasinglelock.Insummary,TL2'ssuperiorperformancetogetherwiththefactthatitcom-binesseamlesslywithhardwaretransactionsandwithanysystem'smemorylife-cycle,makeitanidealcandidateformulti-languagedeploymenttoday,longbeforehardwaretransactionalsupportbecomescommonlyavailable.2TransactionalLockingIITheTL2algorithmwedescribehereisaglobalversion-clockbasedvariantofthetransactionallockingalgorithmofDiceandShavit(TL)[18].Aswewillexplain,basedonthisglobalversioningapproach,andincontrastwithpriorlocalversioningapproaches,weareabletoeliminateseveralkeysafetyissuesaictingotherlock-basedSTMsystemsandsimplifytheprocessofmechanicalcodetransformation.Inaddition,theuseofglobalversioningwillhopefullyimprovetheperformanceofread-onlytransactions.OurTL2algorithmisatwo-phaselockingschemethatemployscommit-timelockacquisitionmodeliketheTLalgorithm,di eringfromencounter-timealgorithmssuchasthosebyEnnals[5]andSahaetal.[11]. Foreachimplementedtransactionalsystem(i.e.perapplicationordatastruc-ture)wehaveasharedglobalversion-clockvariable.Wedescribeitbelowusinganimplementationinwhichthecounterisincrementedusinganincrement-and-fetchimplementedwithacompare-and-swap(CAS)operation.Alternativeim-plementationexisthoweverthato erimprovedperformance.Theglobalversion-clockwillbereadandincrementedbyeachwritingtransactionandwillbereadbyeveryread-onlytransaction.Weassociateaspecialversionedwrite-lockwitheverytransactedmemorylocation.Initssimplestform,theversionedwrite-lockisasinglewordspinlockthatusesaCASoperationtoacquirethelockandastoretoreleaseit.Sinceoneonlyneedsasinglebittoindicatethatthelockistaken,weusetherestofthelockwordtoholdaversionnumber.Thisnumberisadvancedbyeverysuccessfullock-release.UnliketheTLalgorithmorEnnals[5]andSahaetal.[11],inTL2thenewvaluewrittenintoeachversionedwrite-locklocationwillbeapropertywhichwillprovideuswithseveralperformanceandcorrectnessbene ts.Toimplementagivendatastructureweallocateacollectionofversionedwrite-locks.Wecanusevariousschemesforassociatinglockswithshareddata:perobject(PO),wherealockisassignedpersharedobject,orperstripe(PS),whereweallocateaseparatelargearrayoflocksandmemoryisstriped(parti-tioned)usingsomehashfunctiontomapeachtransactablelocationtoastripe.Othermappingsbetweentransactionalsharedvariablesandlocksarepossible.ThePOschemerequireseithermanualorcompiler-assistedautomaticinser-tionoflock eldswhereasPScanbeusedwithunmodi eddatastructures.POmightbeimplemented,forinstance,byleveragingtheheaderwordsofobjectsintheJavaprogramminglanguage[22,23].AsinglePSstripe-lockarraymaybesharedandusedfordi erentTL2datastructureswithinasingleaddress-space.ForinstanceanapplicationwithtwodistinctTL2red-blacktreesandthreeTL2hash-tablescoulduseasinglePSarrayforallTL2locks.Asourdefaultmappingwechoseanarrayof220entriesof32-bitlockwordswiththemappingfunctionmaskingthevariableaddresswith\0x3FFFFC"andthenaddinginthebaseaddressofthelockarraytoderivethelockaddress.InthefollowingwedescribethePSversionoftheTL2algorithmalthoughmostofthedetailscarrythroughverbatimforPOaswell.Wemaintainthreadlo-calread-andwrite-setsaslinkedlists.Eachread-setentriescontainstheaddressofthelockthat\covers"thevariablebeingread,andunlikeformeralgorithms,doesnotneedtocontaintheobservedversionnumberofthelock.Thewrite-setentriescontaintheaddressofthevariable,thevaluetobewrittentothevari-able,andtheaddressofitsassociatedlock.Inmanycasesthelockandlocationaddressarerelatedandsoweneedtokeeponlyoneofthemintheread-set.Thewrite-setiskeptinchronologicalordertoavoidwrite-after-writehazards.2.1TheBasicTL2AlgorithmWenowdescribehowTL2executesasequentialcodefragmentthatwasplacedwithinaTL2transaction.Asweexplain,TL2doesnotrequiretrapsorthe insertionofvalidationtestswithinusercode,andinthismodedoesnotrequiretype-stablegarbagecollection,workingseamlesslywiththememorylife-cycleoflanguageslikeCandC++.WriteTransactionsThefollowingsequenceofoperationsisperformedbyawritingtransaction,onethatperformswritestothesharedmemory.Wewillassumethatatransactionisawritingtransaction.Ifitisaread-onlytransac-tionthiscanbedenotedbytheprogrammer,determinedatcompiletimeorheuristicallyatruntime.1.Sampleglobalversion-clock:Loadthecurrentvalueoftheglobalversionclockandstoreitinathreadlocalvariablecalledtheread-versionnumber(rv).Thisvalueislaterusedfordetectionofrecentchangestodata eldsbycomparingittotheversion eldsoftheirversionedwrite-locks.2.Runthroughaspeculativeexecution:Executethetransactioncode(loadandstoreinstructionsaremechanicallyaugmentedandreplacedsothatspeculativeexecutiondoesnotchangethesharedmemory'sstate,hencetheterm\speculative".)Locallymaintainaread-setofaddressesloadedandawrite-setaddress/valuepairsstored.Thisloggingfunctionalityisimple-mentedbyaugmentingloadswithinstructionsthatrecordthereadaddressandreplacingstoreswithcoderecordingtheaddressandvalueto-be-written.Thetransactionalload rstchecks(usingaBloom lter[24])toseeiftheloadaddressalreadyappearsinthewrite-set.Ifso,thetransactionalloadreturnsthelastvaluewrittentotheaddress.Thisprovidestheillusionofprocessorconsistencyandavoidsread-after-writehazards.Aloadinstructionsamplingtheassociatedlockisinsertedbeforeeachorig-inalload,whichisthenfollowedbypost-validationcodecheckingthatthelocation'sversionedwrite-lockisfreeandhasnotchanged.Additionally,wemakesurethatthelock'sversion eldisrvandthelockbitisclear.Ifitisgreaterthanrvitsuggeststhatthememorylocationhasbeenmodi edafterthecurrentthreadperformedstep1,andthetransactionisaborted.3.Lockthewrite-set:Acquirethelocksinanyconvenientorderusingboun-dedspinningtoavoidinde nitedeadlock.Incasenotalloftheselocksaresuccessfullyacquired,thetransactionfails.4.Incrementglobalversion-clock:Uponsuccessfulcompletionoflockac-quisitionofalllocksinthewrite-setperformanincrement-and-fetch(usingaCASoperationforexample)oftheglobalversion-clockrecordingthere-turnedvalueinalocalwrite-versionnumbervariablewv.5.Validatetheread-set:validateforeachlocationintheread-setthattheversionnumberassociatedwiththeversioned-write-lockisrv.Wealsoverifythatthesememorylocationshavenotbeenlockedbyotherthreads.Incasethevalidationfails,thetransactionisaborted.Byre-validatingtheread-set,weguaranteethatitsmemorylocationshavenotbeenmodi edwhilesteps3and4werebeingexecuted.Inthespecialcasewhererv+1=wvitisnotnecessarytovalidatetheread-set,asitisguaranteedthatnoconcurrentlyexecutingtransactioncouldhavemodi edit. 6.Commitandreleasethelocks:Foreachlocationinthewrite-set,storetothelocationthenewvaluefromthewrite-setandreleasethelocationslockbysettingtheversionvaluetothewrite-versionwvandclearingthewrite-lockbit(thisisdoneusingasimplestore).Afewthingstonote.Thewrite-lockshavebeenheldforabrieftimewhenattemptingtocommitthetransaction.Thishelpsimproveperformanceunderhighcontention.TheBloom lterallowsustodetermineifavalueisnotinthewrite-setandneednotbesearchedforbyreadingthesingle lterword.Thoughlockscouldhavebeenacquiredinascendingaddressordertoavoiddeadlock,wefoundthatsortingtheaddressesinthewrite-setwasnotworththee ort.Low-CostRead-OnlyTransactionsOneofthegoalsoftheproposedmetho-dology'sdesignisanecientexecutionofread-onlytransactions,astheydomi-nateusagepatternsinmanyapplications.Toexecutearead-onlytransaction:1.Sampletheglobalversion-clock:Loadthecurrentvalueoftheglobalversion-clockandstoreitinalocalvariablecalledread-version(rv).2.Runthroughaspeculativeexecution:Executethetransactioncode.Eachloadinstructionispost-validatedbycheckingthatthelocation'sver-sionedwrite-lockisfreeandmakingsurethatthelock'sversion eldisrv.Ifitisgreaterthanrvthetransactionisaborted,otherwisecommits.Ascanbeseen,theread-onlyimplementationishighlyecientbecauseitdoesnotconstructorvalidatearead-set.Detectionofread-onlybehaviorcanbedoneatthelevelofofeachspeci ctransactionsite(e.g.,methodoratomicblock).Thiscanbedoneatcompiletimeorbysimplyrunningallmethods rstasread-only,andupondetectingthe rsttransactionalwrite,abortandseta\ragtoindicatethatthismethodshouldhenceforthbeexecutedinwritemode.2.2ALowContentionGlobalVersion-ClockImplementationTherearevariouswaysinwhichonecouldimplementtheglobalversion-clockusedinthealgorithm.Thekeydicultywiththeglobalclockimplementationisthatitmayintroduceincreasedcontentionandcostlycachecoherentsharing.Oneapproachtoreducingthisoverheadisbasedonsplittingtheglobalversion-clockvariablesoitincludesaversionnumberandathreadid.Basedonthissplit,athreadwillnotneedtochangetheversionnumberifitisdi erentthantheversionnumberitusedwhenitlastwrote.Insuchacaseallitwillneedtodoiswriteitsownversionnumberinanygivenmemorylocation.Thiscanleadtoanoverallreductionbyafactorofninthenumberofversionclockincrements.1.Eachversionnumberwillincludethethreadidofthethreadthatlastmod-i edit. 2.Eachthread,whenperformingtheload/CAStoincrementtheglobalversion-clock,checksaftertheloadtoseeiftheglobalversion-clockdi ersfromthethread'spreviouswv(notethatifitfailsontheCASandretriestheload/CASthenitknowsthenumberwaschanged).Ifitdi ers,thenthethreaddoesnotperformtheCAS,andwritestheversionnumberitloadedanditsidintoalllocationsitmodi es.Iftheglobalversionnumberhasnotchanged,thethreadmustCASanewglobalversionnumbergreaterbyoneanditsidintotheglobalversionandusethisineachlocation.3.Toread,athreadloadstheglobalversion-clock,andanylocationwithaversionnumber�rvor=rvandhavinganiddi erentthanthatofthetransactionwholastchangedtheglobalversionwillcauseatransactionfailure.ThishasthepotentialtocutthenumberofCASoperationsontheglobalversion-clockbyalinearfactor.Itdoeshoweverintroducethepossibilityof\falsepositive"failures.Inthesimpleglobalversion-clockwhichisalwaysincremented,areadofsomelocationthatsaw,say,valuev+n,wouldnotfailonthingslessthanv+n,butwiththenewscheme,itcouldbethatthreads1..n-1allperformnon-modifyingincrementsbychangingonlytheidpartofaversion-clock,leavingthevalueunchangedatv,andthereaderalsoreadsvfortheversion-clock(insteadofv+nashewouldhaveintheregularscheme).Itcanthusfailonaccountofeachofthewriteseventhoughintheregularschemeitwouldhaveseenmostofthemwithvaluesv:::v+n1.2.3MixedTransactionalandNon-TransactionalMemoryManagementThecurrentimplementationofTL2viewsmemoryasbeingclearlydividedintotransactionalandnon-transactional(heap)spacewheremixed-modetransac-tionalandnon-transactionalaccessesareproscribed.Aslongasamemorylo-cationcanbeaccessedbytransactionalloadorstoreoperationsitmustnotbeaccessibletonon-transactionalloadandstoreoperationsandviceversa.Wedohoweverwishtoallowmemoryrecycledfromonespacetobereusableintheother.Fortype-safegarbagecollectedmanagedruntimeenvironmentssuchasthatoftheJavaprogramminglanguage,anyoftheTL2lock-mappingpolicies(PSorPO)providethispropertyastheGCassuresthatmemorywillonlybereleasedoncenoreferencesremaintoanobject.However,inlanguagessuchasCorC++thatprovidetheprogrammerwithexplicitmemorymanagementopera-tionssuchasmallocandfree,wemusttakecarenevertofreeobjectswhiletheyareaccessible.Thepitfallsof ndingasolutionforsuchlanguagesareexplainedindetailin[18].Thereisasimplesolutionfortheper-stripe(PS)variationofTL2(andinthetheearlierTL[18]scheme)thatworkswithanymalloc/freeorsimilarstylepairofoperations.Inthetransactionalspace,athreadexecutingatransactioncanonlyreachanobjectbyfollowingasequenceofreferencesthatareincludedinthetransaction'sread-set.Byvalidatingthetransactionbeforewritingthe locationswecanmakesurethatthereadsetisconsistent,guaranteeingthattheobjectisaccessibleandhasnotbeenreclaimed.Transactedmemorylocationsaremodi edafterthetransactionisvalidatedandbeforetheirassociatedlocksarereleased.Thisleavesashortperiodinwhichtheobjectsinthetransaction'swrite-setmustnotbefreed.Topreventobjectsfrombeingfreedinthatperiod,threadsletobjectsquiescebeforefreeingthem.Byquiescingwemeanlettinganyactivityonthetransactionallocationscompletebymakingsurethatalllocksonanobject'sassociatedmemorylocationsarereleasedbytheirownersbefore.Onceanobjectisquiesceditcanbefreed.Thisschemeworksbecauseanytransactionthatmayacquirethelockandreachthedisconnectedlocationwillfailitsread-setvalidation.Unfortunately,wehavenotfoundanecientschemeforusingthePOmodeofTL2inCorC++becauselocksresideinsidetheobjectheader,andtheactofacquiringalockcannotguaranteedtotakeplacewhiletheobjectisalive.Ascanbeseenintheperformancesection,onthebenchmarks/machinewetestedthereisapenalty,thoughnotanunbearableone,forusingPSmodeinsteadofPO.InSTMsthatuseencounter-timelockacquisitionandundo-logs[5,11]itissigni cantlyhardertoprotectobjectsfrombeingmodi edaftertheyarereclaimed,asmemorylocationsaremodi edoneatatime,replacingoldvalueswiththenewvalueswrittenbythetransaction.Evenwithquiescing,toprotectfromillegalmemorymodi cations,onewouldhavetorepeatedlyvalidatetheentiretransactionbeforeupdatingeachlocationinthewrite-set.Thisrepeatedvalidationisinecientinitssimplestformandcomplex(ifatallpossible)ifoneattemptstousethecompilertoeliminateunnecessaryvalidations.2.4MechanicalTransformationofSequentialCodeAswediscussedearlier,thealgorithmwedescribecanbeaddedtocodeinamechanicalfashion,thatis,withoutunderstandinganythingabouthowthecodeworksorwhattheprogramitselfdoes.Inourbenchmarks,weperformedthetransformationbyhand.Wedohoweverbelievethatitmaybefeasibletoautomatethisprocessandallowacompilertoperformthetransformationgivenafewrathersimplelimitationsonthecodestructurewithinatransaction.Wenotethathand-crafteddatastructurescanalwayshaveanadvantageoverTL2,asTL2hasnowayofknowingthatpriorloadsexecutedwithinatransactionmightnolongerhaveanybearingonresultsproducedbytransaction.2.5Software-HardwareInteroperabilityThoughwehavedescribedTL2asasoftwarebasedscheme,itcanbemadeinter-operablewithHTMsystems.Onamachinesupportingdynamichardwaretransactions,transactionsneedonlyverifyforeachlocationreadorwrittenthattheassociatedversionedwrite-lockisfree.Thereisnoneedforthehardwaretransactiontostoreanintermediatelockedstateintothelockword(s).Forev-erywritetheyalsoneedtoupdatetheversionnumberoftheassociatedlock uponcompletion.Thissucestoprovideinteroperabilitybetweenhardwareandsoftwaretransactions.Anysoftwarereadwilldetectconcurrentmodi cationsoflocationsbyahardwarewritesbecausetheversionnumberoftheassociatedlockwillhavechanged.Anyhardwaretransactionwillfailifaconcurrentsoft-waretransactionisholdingthelocktowrite.Softwaretransactionsattemptingtowritewillalsofailinacquiringalockonalocationsincelockacquisitionisdoneusinganatomichardwaresynchronizationoperation(suchasCASorasinglelocationtransaction)whichwillfailiftheversionnumberofthelocationwasmodi edbythehardwaretransaction.3EmpiricalPerformanceEvaluationWepresenthereasetofmicrobenchmarksthathavebecomestandardinthecommunity[25],comparingasequentialred-blacktreemadeconcurrentusingvariousalgorithmsrepresentingstate-of-the-artnon-blocking[6]andlock-based[5,18]STMs.Forlackofspacewecanonlypresentthered-blacktreedatastructureandonlyfourperformancegraphs.Thesequentialred-blacktreemadeconcurrentusingourtransactionallockingalgorithmwasderivedfromthejava.util.TreeMapimplementationfoundintheJavaprogramminglanguageJDK6.0.ThatimplementationwaswrittenbyDougLeaandJoshBloch.Inturn,partsoftheJavaTreeMapwerederivedfromtheCormenetal.[26].WewouldhavepreferredtousetheexactFraser-Harrisred-blacktree[6]butthatcodewaswrittentotheirspeci ctransactionalinterfaceandcouldnotreadilybeconvertedtoasimpleform.Thesequentialred-blacktreeimplementationexposesakey-valuepairinter-faceofput,delete,andgetoperations.Theputoperationinstallsakey-valuepair.Ifthekeyisnotpresentinthedatastructureputwillinsertanewel-ementdescribingthekey-valuepair.Ifthekeyisalreadypresentinthedatastructureputwillsimplyupdatethevalueassociatedwiththeexistingkey.Thegetoperationqueriesthevalueforagivenkey,returninganindicationifthekeywaspresentinthedatastructure.Finally,deleteremovesakeyfromthedatastructure,returninganindicationifthekeywasfoundtobepresentinthedatastructure.Thebenchmarkharnesscallsput,getanddeletetooperateontheunderlyingdatastructure.Theharnessallowsfortheproportionofput,getanddeleteoperationstobevariedbywayofcommandlinearguments,aswellasthenumberofthreads,trialduration,initialnumberofkey-valuepairstobeinstalledinthedatastructure,andthekey-range.Thekeyrangedescribesthemaximumpossiblesize(capacity)ofthedatastructure.Forourexperimentsweuseda16-processorSunFireTMV890whichisacachecoherentmultiprocessorwith1.35GhzUltraSPARC-IVR\rprocessorsrun-ningSolarisTM10.Asclaimedintheintroduction,modernoperatingsystemshandlelockingwell,evenwhenthenumberofthreadsislargerthanthenumberofCPUs.Inourbenchmarks,ourofSTMsusedtheSolarisschedctlmecha-nismtoallowthreadstorequestshort-termpreemptiondeferralbystoringtoathread-speci clocationwhichisreadbythekernel-levelscheduler.Preemption deferralisadvisory-thekernelwilltrytoavoidpreemptingathreadthathasrequesteddeferral.Wenotethatunfortunatelywecouldnotintroducetheuseofschedctl-basedpreemptiondeferralintothehandcraftedlock-basedhankecode,thelock-basedstm ennalscodedescribedbelow,orstm fraser.Thisaf-fectedtheirrelativeperformancebeyond16threadsbutnotintherangebelow16threads.Ourbenchmarkedalgorithmsincluded:MutexWeusedtheSolarisPOSIXthreadslibrarymutexasacoarse-grainedlockingmechanism.stm fraserThisisthestate-of-the-artnon-blockingSTMofHarrisandFraser[6].Weusethenameoriginallygiventotheprogrambyitsauthors.Ithasaspecialrecordperobjectwithapointertoatransactionrecord.Thetransformationofsequentialtotransactionalcodeisnotmechanical:theprogrammerspeci eswhenobjectsaretransactionallyopenedandclosedtoimproveperformance.stm ennalsThisisthelock-basedencounter-timeobject-basedSTMalgorithmofEnnalstakenfrom[5]andprovidedinLibLTX[6].NotethatLibLTXincludestheoriginalFraserandHarrislockfree-libpackage.Itusesalockperobjectandanon-mechanicalobject-basedinterfaceof[6].ThoughwedidnothaveaccesstocodefortheSahaetal.algorithm[11],webelievetheEnnalsalgorithmtobeagoodrepresentativethisclassofalgorithms,withthepossiblebene tthattheEnnalsstructureswerewrittenusingthenon-mechanicalobject-basedinterfaceof[6]andbecauseunlikeSahaetal.,Ennalswrite-setisnotsharedamongthreads.TL/POAversionofouralgorithm[18]whichdoesnotuseaglobalversionclock,insteaditcollectsreadandwrite-setsandvalidatestheread-setafteracquiringthelocksonthememorylocations.UnlikeTL2,itthusrequiresasaferunningenvironment.Webringheretheper-objectlockingvariationoftheTLalgorithm.hankeThisisthehand-craftedlock-basedconcurrentrelaxedred-blacktreeimplementationofHanke[27]ascodedbyFraser[6].Theideaofrelaxedbalancingistouncouplethere-balancingfromtheupdatinginordertospeeduptheupdateoperationsandtoallowahighdegreeofconcurrency.TL2Ournewtransactionallockingalgorithm.WeusethenotationTL2/POandTL2/PStodenotetheper-objectandper-stripevariations.ThePOvariationconsistentlyperformedbetterthanPS,butPSiscompatiblewithopenmemorysystems.InFigure1wepresentfourred-blacktreebenchmarksperformedusingtwodi erentkeyrangesandtwosetoperationdistributions.Thekeyrangeof[100;200]generatesasmallsizetreewhiletherange[10000;20000]createsalargertree,imposinglargertransactionsizeforthesetoperations.Thedi erentoperationdistributionsrepresenttwotypeofworkloads,onedominatedbyreads(5%puts,5%deletes,and90%gets)andtheother(30%puts,30%deletes,and40%gets)dominatedbywrites. 0 2000 4000 6000 8000 10000 12000 14000 5 10 15 20 25 30 1000 X ops/secthreads(a) Small Red-Black Tree 5%/5%/90% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 500 1000 1500 2000 2500 3000 3500 4000 5 10 15 20 25 30 1000 X ops/secthreads(b) Small Red-Black Tree 30%/30%/40% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 2000 4000 6000 8000 10000 12000 14000 5 10 15 20 25 30 1000 X ops/secthreads(c) Large Red-Black Tree 5%/5%/90% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5 10 15 20 25 30 1000 X ops/secthreads(d) Large Red-Black Tree 30%/30%/40% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke Fig.1.ThroughputofRed-BlackTreewith5%putsand5%deletesand30%puts,30%deletesInallfourgraphs,allalgorithmsscalequitewellto16processors,withtheexceptionofthemutualexclusionbasedone.Ennals'salgorithmperformsbadlyonthecontendedwrite-dominatedbenchmark,apparentlysu eringfromfre-quenttransactioncollisions,whicharemorelikelytooccurinencounter-timelockingbasedsolutions.Beyond16threads,theHankeandEnnalsalgorithmsdeterioratebecausewecouldnotintroducetheschedctlmechanismtoallowthreadstorequestshort-termpreemptiondeferral.ItisinterestingtonotethattheFraser-HarrisSTMcontinuestoperformwellbeyond16threadsevenwithoutthismechanismbecauseitisnon-blocking.Asexpected,objectbasedalgorithms(PO)dobetterthanstripe-based(PS)onesbecauseoftheimprovedlocalityinaccessingthelocksandthedata.TheperformanceofalltheSTMimplementationsusuallydi ersbyaconstantfactor,mostofwhichweassociatewiththeoverheadsofthealgorithmicmech-anismsemployed(asseeninthesinglethreadperformance).Thehand-craftedalgorithmofHankeprovidesthehighestthroughputinmostcasesbecauseitssinglethreadperformance(ameasureofoverhead)issuperiortoallSTMal-gorithms.OnthesmallerdatastructuresTL/PO(orTL/PS)performsbetterthanTL2/PO(respectivelyTL2/PS)becauseofloweroverheads,partofwhichcanbeassociatedwithinvalidationtraccausedbyupdatesoftheversionclock(thisisnottraccausedbyCASoperationsonthesharedlocation.Itisduetothefactthatthelocationisbeingupdated).ThisisreversedandTL2becomes superiorwhenthedatastructureislargebecausetheread-setislargerandread-onlytransactionsincurlessoverheadinTL2.TheTLandTL2algorithmsareinmostcasessuperiortoEnnals'sSTMandFraserandHarris'sSTM.InallbenchmarkstheyareanorderofmagnitudefasterthanthesinglelockMuteximplementation.4ConclusionTheTL2algorithmpresentedinthispaperprovidesasafeandeasytointegrateSTMimplementationwithreasonableperformance,providingaprogrammingenvironmentsimilartothatattainedusinggloballocks,butwithaten-foldim-provementinperformance.TL2willeasilycombinewithhardwaretransactionalmechanismsoncethesebecomeavailable.Itprovidesastrongindicationthatweshouldcontinuetodeviselock-basedSTMs.ThereishoweverstillmuchworktobedonetoimproveTL2'sperformance.Alotoftheseimprovementsmayrequirehardwaresupport,forexample,inimplementingtheglobalversionclockandinspeedingupthecollectionoftheread-set.ThefullTL2codewillbepubliclyavailableshortly.References1.Herlihy,M.,Moss,E.:Transactionalmemory:Architecturalsupportforlock-freedatastructures.In:ProceedingsoftheTwentiethAnnualInternationalSymposiumonComputerArchitecture.(1993)2.Rajwar,R.,Herlihy,M.,Lai,K.:Virtualizingtransactionalmemory.In:ISCA'05:Proceedingsofthe32ndAnnualInternationalSymposiumonComputerArchitec-ture,Washington,DC,USA,IEEEComputerSociety(2005)494{5053.Ananian,C.S.,Asanovic,K.,Kuszmaul,B.C.,Leiserson,C.E.,Lie,S.:Unboundedtransactionalmemory.In:HPCA'05:Proceedingsofthe11thInternationalSympo-siumonHigh-PerformanceComputerArchitecture,Washington,DC,USA,IEEEComputerSociety(2005)316{3274.Hammond,L.,Wong,V.,Chen,M.,Carlstrom,B.D.,Davis,J.D.,Hertzberg,B.,Prabhu,M.K.,Wijaya,H.,Kozyrakis,C.,Olukotun,K.:Transactionalmemoryco-herenceandconsistency.In:ISCA'04:Proceedingsofthe31stannualinternationalsymposiumonComputerarchitecture,Washington,DC,USA,IEEEComputerSociety(2004)1025.Ennals,R.:Softwaretransactionalmemoryshouldnotbeobstruction-free.www.cambridge.intel-research.net/rennals/notlockfree.pdf(2005)6.Harris,T.,Fraser,K.:Concurrentprogrammingwithoutlocks.www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-cpwl-submission.pdf(2004)7.Herlihy,M.:TheSXMsoftwarepackage.http://www.cs.brown.edu/mph/SXM/README.doc(2005)8.Herlihy,M.,Luchangco,V.,Moir,M.,Scherer,W.:Softwaretransactionalmemoryfordynamicdatastructures.In:Proceedingsofthe22ndAnnualACMSymposiumonPrinciplesofDistributedComputing.(2003)9.Marathe,V.J.,SchererIII,W.N.,Scott,M.L.:Adaptivesoftwaretransactionalmemory.In:Proceedingsofthe19thInternationalSymposiumonDistributedComputing,Cracow,Poland(2005) 10.Moir,M.:HybridTM:Integratinghardwareandsoftwaretransactionalmemory.TechnicalReportArchivist2004-0661,SunMicrosystemsResearch(2004)11.Saha,B.,Adl-Tabatabai,A.R.,Hudson,R.L.,Minh,C.C.,Hertzberg,B.:Ahighperformancesoftwaretransactionalmemorysystemforamulti-coreruntime.In:ToappearinPPoPP2006.(2006)12.Shavit,N.,Touitou,D.:Softwaretransactionalmemory.DistributedComputing10(2)(1997)99{11613.Welc,A.,Jagannathan,S.,Hosking,A.L.:Transactionalmonitorsforconcurrentobjects.In:ProceedingsoftheEuropeanConferenceonObject-OrientedProgram-ming.Volume3086ofLectureNotesinComputerScience.,Springer-Verlag(2004)519{54214.Ananian,C.S.,Rinard,M.:Ecientsoftwaretransactionsforobject-orientedlan-guages.In:ProceedingsofSynchronizationandConcurrencyinObject-OrientedLanguages(SCOOL),ACM(2005)15.Marathe,V.J.,Scherer,W.N.,Scott,M.L.:Designtradeo sinmodernsoftwaretransactionalmemorysystems.In:LCR'04:Proceedingsofthe7thworkshoponWorkshoponlanguages,compilers,andrun-timesupportforscalablesystems,NewYork,NY,USA,ACMPress(2004)1{716.Rajwar,R.,Hill,M.:Transactionalmemoryonline.http://www.cs.wisc.edu/trans-memory(2006)17.Harris,T.,Fraser,K.:Languagesupportforlightweighttransactions.SIGPLANNot.38(11)(2003)388{40218.Dice,D.,Shavit,N.:Whatreallymakestransactionsfast?In:TRANSACT06ACMWorkshop.(2006)19.Afek,Y.,Attiya,H.,Dolev,D.,Gafni,E.,Merritt,M.,Shavit,N.:Atomicsnapshotsofsharedmemory.J.ACM40(4)(1993)873{89020.Thomasian,A.:Concurrencycontrol:methods,performance,andanalysis.ACMComput.Surv.30(1)(1998)70{11921.Riegel,T.,Felber,P.,Fetzer,C.:ALazySnapshotAlgorithmwithEagerVal-idation.In:20thInternationalSymposiumonDistributedComputing(DISC).(2006)22.Agesen,O.,Detlefs,D.,Garthwaite,A.,Knippel,R.,Ramakrishna,Y.S.,White,D.:Anecientmeta-lockforimplementingubiquitoussynchronization.ACMSIGPLANNotices34(10)(1999)207{22223.Dice,D.:Implementingfastjavamonitorswithrelaxed-locks.In:JavaVirtualMachineResearchandTechnologySymposium,USENIX(2001)79{9024.Bloom,B.H.:Space/timetrade-o sinhashcodingwithallowableerrors.Commun.ACM13(7)(1970)422{42625.Herlihy,M.,Luchangco,V.,Moir,M.,Scherer,III,W.N.:Softwaretransactionalmemoryfordynamic-sizeddatastructures.In:Proceedingsofthetwenty-secondannualsymposiumonPrinciplesofdistributedcomputing,ACMPress(2003)92{10126.Cormen,T.H.,Leiserson,Charles,E.,Rivest,R.L.:IntroductiontoAlgorithms.MITPress(1990)CORth01:11.Ex.27.Hanke,S.:Theperformanceofconcurrentred-blacktreealgorithms.LectureNotesinComputerScience1668(1999)286{300