com TelAviv University TelAviv 69978 Israel orishposttauacil Abstract The transactional memory programming paradigm is gain ing momentum as the approach of choice for replacing locks in concurrent programming This paper introduces the transactional l ID: 26413
Download Pdf The PPT/PDF document "Transactional Locking II Dave Dice Ori ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
TransactionalLockingIIDaveDice1,OriShalev2;1,andNirShavit11SunMicrosystemsLaboratories,1NetworkDrive,BurlingtonMA01803-0903,fdice,shanirg@sun.com2Tel-AvivUniversity,Tel-Aviv69978,Israel,orish@post.tau.ac.ilAbstract.Thetransactionalmemoryprogrammingparadigmisgain-ingmomentumastheapproachofchoiceforreplacinglocksinconcurrentprogramming.ThispaperintroducesthetransactionallockingII(TL2)algorithm,asoftwaretransactionalmemory(STM)algorithmbasedonacombinationofcommit-timelockingandanovelglobalversion-clockbasedvalidationtechnique.TL2improvesonstate-of-the-artSTMsinthefollowingways:(1)unlikeallotherSTMsittsseamlesslywithanysystemsmemorylife-cycle,includingthoseusingmalloc/free(2)unlikeallotherlock-basedSTMsitecientlyavoidsperiodsofunsafeexecution,thatis,usingitsnovelversion-clockvalidation,usercodeisguaranteedtooperateonlyonconsistentmemorystates,and(3)inasequenceofhighperformancebenchmarks,whileprovidingthesenewproperties,itdeliv-eredoverallperformancecomparableto(andinmanycasesbetterthan)thatofallformerSTMalgorithms,bothlock-basedandnon-blocking.Perhapsmoreimportantly,onvariousbenchmarks,TL2deliversperfor-mancethatiscompetitivewiththebesthand-craftedne-grainedcon-currentstructures.Specically,itisten-foldfasterthanasinglelock.WebelievethesecharacteristicsmakeTL2aviablecandidatefordeploy-mentoftransactionalmemorytoday,longbeforehardwaretransactionalsupportisavailable.1IntroductionAgoalofcurrentmultiprocessorsoftwaredesignistointroduceparallelismintosoftwareapplicationsbyallowingoperationsthatdonotcon\rictinaccessingmemorytoproceedconcurrently.Thekeytoolindesigningconcurrentdatastructureshasbeentheuseoflocks.Coarse-grainedlockingiseasytoprogram,butunfortunatelyprovidesverypoorperformancebecauseoflimitedparallelism.Fine-grainedlock-basedconcurrentdatastructuresperformexceptionallywell,butdesigningthemhaslongbeenrecognizedasadiculttaskbetterlefttoexperts.Ifconcurrentprogrammingistobecomeubiquitous,researchersagreethatalternativeapproachesthatsimplifycodedesignandvericationmustbedeveloped.Thispaperisinterestedin\mechanical"methodsfortransformingsequentialcodeorcoarse-grainedlock-basedcodeintoconcurrentcode.Byme-chanicalwemeanthatthetransformation,whetherdonebyhand,byaprepro-cessor,orbyacompiler,doesnotrequireanyprogramspecicinformation(such astheprogrammer'sunderstandingofthedata\rowrelationships).Moreover,wewishtofocusontechniquesthatcanbedeployedtodeliverreasonableperfor-manceacrossawiderangeofsystemstoday,yetcombineeasilywithspecializedhardwaresupportasitbecomesavailable.1.1TransactionalProgrammingThetransactionalmemoryprogrammingparadigmofHerlihyandMoss[1]isgainingmomentumastheapproachofchoiceforreplacinglocksinconcurrentprogramming.Combiningsequencesofconcurrentoperationsintoatomictrans-actionsseemstopromiseagreatreductioninthecomplexityofbothprogram-mingandverication{bymakingpartsofthecodeappeartobesequentialwith-outtheneedtoprogramne-grainedlocks.Transactionswillhopefullyremovefromtheprogrammertheburdenofguringouttheinteractionamongconcur-rentoperationsthathappentocon\rictwitheachother.Non-con\rictingTrans-actionswillrununinterruptedinparallel,andthosethatdowillbeabortedandretriedwithouttheprogrammerhavingtoworryaboutissuessuchasdeadlock.Therearecurrentlyproposalsforhardwareimplementationsoftransactionalmemory(HTM)[1{4],purelysoftwarebasedones,i.e.softwaretransactionalmemories(STM)[5{13],andhybridschemes(HyTM)thatcombinehardwareandsoftware[14,10].3Thedominanttrendamongtransactionalmemorydesignsseemstobethatthetransactionsprovidedtotheprogrammer,ineitherhardwareorsoftware,shouldbe\largescale",thatis,unbounded,anddynamic.Unboundedmeansthatthereisnolimitonthenumberoflocationsaccessedbythetransaction.Dynamic(asopposedtostatic)meansthatthesetoflocationsaccessedbythetransactionisnotknowninadvanceandisdeterminedduringitsexecution.Providinglargescaletransactionsinhardwaretendstointroducelargede-greesofcomplexityintothedesign[1{4].Providingthemecientlyinsoftwareisadiculttask,andthereseemtobenumerousdesignparametersandap-proachesintheliterature[5{11].aswellasrequirementstocombinewellwithhardwaretransactionsoncethosebecomeavailable[14,10].1.2Lock-BasedSoftwareTransactionalMemorySTMdesignhascomealongwaysincetherstSTMalgorithmbyShavitandTouitou[12],whichprovidedanon-blockingimplementationofstatictransac-tions(see[5{8,15,9{13]).ArecentpaperbyEnnals[5]suggestedthatonmodernoperatingsystemsdeadlockpreventionistheonlycompellingreasonformakingtransactionsnon-blocking,andthatthereisnoreasontoprovideitfortransac-tionsattheuserlevel.Wesecondthisclaim,notingthatmechanismsalreadyexistwherebythreadsmightyieldtheirquantatootherthreadsandthatSo-laris'schedctlallowsthreadstotransientlydeferpreemptionwhileholdinglocks. 3Abroadsurveyofpriorartcanbefoundin[6,15,16]. Ennals[5]proposedanall-softwarelock-basedimplementationofsoftwaretrans-actionalmemoryusingtheobject-basedapproachof[17].Hisideawastorunthroughthetransactionpossiblyoperatingonaninconsistentmemorystate,ac-quiringwritelocksaslocationstobewrittenareencountered,writingthenewvaluesinplaceandhavingpointerstoanundosetthatisnotsharedwithotherthreads.Theuseoflockseliminatestheneedforindirectionandsharedtrans-actionrecordsasinthenon-blockingSTMs,itstillrequireshoweveraclosedmemorysystem.Deadlocksandlivelocksaredealtwithusingtimeoutsandtheabilityoftransactionstorequestothertransactionstoabort.AnotherrecentpaperbySahaetal.[11],usesaversionofEnnals'lock-basedalgorithmwithinarun-timesystem.TheschemedescribedbySahaetal.acquireslocksastheyareencountered,butalsokeepssharedundosetstoallowtransactionstoactivelyabortothers.Aworkshoppresentationbytwooftheauthors[18]showsthatlock-basedSTMstendtooutperformnon-blockingonesduetosimpleralgorithmsthatresultinloweroverheads.However,twolimitationsremain,limitationsthatmustbeovercomeifSTMsaretobecommerciallydeployed:ClosedMemorySystemsMemoryusedtransactionallymustberecyclabletobeusednon-transactionallyandviceversa.Thisisrelativelyeasyingarbagecollectedlanguages,butmustalsobesupportedinlanguageslikeCwithstandardmalloc()andfree()operations.Unfortunately,allnon-blockingSTMdesignsrequireclosedmemorysystems,andthelock-basedSTMs[5,11]eitheruseclosedsystemsorrequirespecializedmalloc()andfree()oper-ations.SpecializedManagedRuntimeEnvironmentsCurrentecientSTMs[5,11]requirespecialenvironmentscapableofcontainingirregulareectsinordertoavoidunsafebehaviorresultingfromtheiroperatingoninconsistentstates.TheTL2algorithmpresentedinthispaperistherstSTMthatovercomesbothoftheselimitations:itworkswithanopenmemorysystem,essentiallywithanytypeofmalloc()andfree(),anditrunsusercodeonlyonconsistentstates,eliminatingtheneedforspecializedmanagedruntimeenvironments4.1.3VulnerabilitiesofSTMsLetusexplaintheabovevulnerabilitiesinmoredetail.CurrentecientSTMimplementations[18,17,5,11]requireclosedmemorysystemsaswellasman-agedruntimeenvironmentscapableofcontainingirregulareects.Theseclosedsystemsandmanagedenvironmentsarenecessaryforecientexecution.Withintheseenvironments,theyallowtheexecutionof\zombies":transactionsthathaveobservedaninconsistentread-setbuthaveyettoabort.Therelianceonanaccumulatedread-setthatisnotavalidsnapshot[19]ofthesharedmemory 4TheTLalgorithm[18],aprecursorofTL2,workswithanopenmemorysystembutrunsoninconsistentstates. locationsaccessedcancauseunexpectedbehaviorsuchasinniteloops,illegalmemoryaccesses,andotherrun-timemisbehavior.Thespecializedruntimeenvironmentabsorbstraps,convertingthemtotrans-actionretries.Handlinginniteloopsinzombiesisusuallydonebyvalidatingtransactionswhileinprogress.Validatingtheread-setoneverytransactionalloadwouldguaranteesafety,butwouldalsosignicantlyimpactperformance.Anotheroptionistoperformperiodicvalidations,forexample,onceeverynum-beroftransactionalloadsorwhenloopingintheusercode[11].Ennals[5]at-temptstodetectinniteloopsbyhavingeveryn-thtransactionalobject\open"operationvalidatepartoftheaccumulatedread-set.Unfortunately,thispol-icyadmitsinniteloops(asitispossibleforatransactiontoreadlessthanninconsistentmemorylocationsandcausethethreadtoenteraninniteloopcontainingnosubsequenttransactionalloads).Ingeneral,inniteloopdetectionmechanismsrequireextendingthecompilerortranslatortoinsertvalidationchecksintopotentialloops.ThesecondissuewithexistingSTMimplementationsistheirneedforaclosedmemoryallocationsystem.Fortype-safegarbagecollectedmanagedruntimeen-vironmentssuchasthatoftheJavaprogramminglanguage,thecollectorassuresthattransactionallyaccessedmemorywillonlybereleasedoncenoreferencesremaintotheobject.However,inCorC++,anobjectmaybefreedanddepartthetransactionalspacewhileconcurrentlyexecutingthreadscontinuetoaccessit.Theobject'sassociatedlock,ifusedproperly,canoerawayaroundthisproblem,allowingmemorytoberecycledusingstandardmalloc/freestyleop-erations.Therecycledlocationsmightstillbereadbyaconcurrenttransaction,butwillneverbewrittenbyone.1.4OurNewResultsThispaperintroducesthetransactionallockingII(TL2)algorithm.TL2over-comesthedrawbacksofallstate-of-the-artlock-basedalgorithms,includingourearlierTLalgorithm[18].ThenewideainournewTL2algorithmistohave,perhapscounter-intuitively,aglobalversion-clockthatisincrementedoncebyeachtransactionthatwritestomemory,andisreadbyalltransactions.Weshowhowthissharedclockcanbeconstructedsothatforallbuttheshortesttransactions,theeectsofcontentionareminimal.Wenotethatthetechniqueoftime-stampingtransactionsiswellknowninthedatabasecommunity[20].Aglobal-clockbasedSTMisalsoproposedbyRiegeletal.[21].Ourglobal-clockbasedalgorithmdiersfromthedatabaseworkinthatitistailoredtobehighlyecientasrequiredbysmallSTMtransactionsasopposedtolargedatabaseones.Itdiersfromthe\snapshotisolation"algorithmofRiegeletal.asTL2islock-basedandverysimple,whileRiegeletal.isnon-blockingbutcostlyasitusestime-stampstochoosebetweenmultipleconcurrentcopiesofatransactionbasedontheirassociatedexecutionintervals.InTL2,allmemorylocationsareaugmentedwithalockthatcontainsaversionnumber.Transactionsstartbyreadingtheglobalversion-clockandval-idatingeverylocationreadagainstthisclock.Asweprove,thisallowsusto guaranteeataverylowcostthatonlyconsistentmemoryviewsareeverread.Writingtransactionsneedtocollectaread-setbutread-onlyonesdonot.Onceread-andwrite-setsarecollected,transactionsacquirelocksonlocationstobewritten,incrementtheglobalversion-clockandattempttocommitbyvalidatingtheread-set.Oncecommitted,transactionsupdatethememorylocationswiththenewglobalversion-clockvalueandreleasetheassociatedlocks.WebelieveTL2isrevolutionaryinthatitovercomesmostofthesafetyandperformanceissuesthathaveplaguedhigh-performancelock-basedSTMimplementations:{Unlikeallformerlock-basedSTMsitecientlyavoidsvulnerabilitiesrelatedtoreadinginconsistentmemorystates,nottomentionthefactthatformerlock-basedSTMsmustusecompilerassistormanualprogrammerinterven-tiontoperformvaliditytestsinusercodetotryandavoidasmanyofthesezombiebehaviorsaspossible.Theneedtoovercomethesesafetyvulnerabili-tieswillbeamajorfactorwhengoingfromexperimentalalgorithmstoactualproductionqualitySTMs.Moreover,asSahaetal.[11]explain,validationintroducedtolimittheeectsofthesesafetyissuescanhaveasignicantimpactonoverallSTMperformance.{UnlikeanyformerSTM,TL2allowstransactionalmemorytoberecycledintonon-transactionalmemoryandbackusingmallocandfreestyleoperations.Thisisdoneseamlesslyandwithnoaddedcomplexity.{AsweshowinSection3,ratherencouragingly,concurrentred-blacktreesderivedinamechanicalfashionfromsequentialcodeusingtheTL2algo-rithmandprovidingtheabovesoftwareengineeringbenets,tendtoper-formaswellasprioralgorithms,exhibitingperformancecomparabletothatofhand-craftedne-grainedlock-basedalgorithms.OverallTL2isanorderofmagnitudefasterthansequentialcodemadeconcurrentusingasinglelock.Insummary,TL2'ssuperiorperformancetogetherwiththefactthatitcom-binesseamlesslywithhardwaretransactionsandwithanysystem'smemorylife-cycle,makeitanidealcandidateformulti-languagedeploymenttoday,longbeforehardwaretransactionalsupportbecomescommonlyavailable.2TransactionalLockingIITheTL2algorithmwedescribehereisaglobalversion-clockbasedvariantofthetransactionallockingalgorithmofDiceandShavit(TL)[18].Aswewillexplain,basedonthisglobalversioningapproach,andincontrastwithpriorlocalversioningapproaches,weareabletoeliminateseveralkeysafetyissuesaictingotherlock-basedSTMsystemsandsimplifytheprocessofmechanicalcodetransformation.Inaddition,theuseofglobalversioningwillhopefullyimprovetheperformanceofread-onlytransactions.OurTL2algorithmisatwo-phaselockingschemethatemployscommit-timelockacquisitionmodeliketheTLalgorithm,dieringfromencounter-timealgorithmssuchasthosebyEnnals[5]andSahaetal.[11]. Foreachimplementedtransactionalsystem(i.e.perapplicationordatastruc-ture)wehaveasharedglobalversion-clockvariable.Wedescribeitbelowusinganimplementationinwhichthecounterisincrementedusinganincrement-and-fetchimplementedwithacompare-and-swap(CAS)operation.Alternativeim-plementationexisthoweverthatoerimprovedperformance.Theglobalversion-clockwillbereadandincrementedbyeachwritingtransactionandwillbereadbyeveryread-onlytransaction.Weassociateaspecialversionedwrite-lockwitheverytransactedmemorylocation.Initssimplestform,theversionedwrite-lockisasinglewordspinlockthatusesaCASoperationtoacquirethelockandastoretoreleaseit.Sinceoneonlyneedsasinglebittoindicatethatthelockistaken,weusetherestofthelockwordtoholdaversionnumber.Thisnumberisadvancedbyeverysuccessfullock-release.UnliketheTLalgorithmorEnnals[5]andSahaetal.[11],inTL2thenewvaluewrittenintoeachversionedwrite-locklocationwillbeapropertywhichwillprovideuswithseveralperformanceandcorrectnessbenets.Toimplementagivendatastructureweallocateacollectionofversionedwrite-locks.Wecanusevariousschemesforassociatinglockswithshareddata:perobject(PO),wherealockisassignedpersharedobject,orperstripe(PS),whereweallocateaseparatelargearrayoflocksandmemoryisstriped(parti-tioned)usingsomehashfunctiontomapeachtransactablelocationtoastripe.Othermappingsbetweentransactionalsharedvariablesandlocksarepossible.ThePOschemerequireseithermanualorcompiler-assistedautomaticinser-tionoflockeldswhereasPScanbeusedwithunmodieddatastructures.POmightbeimplemented,forinstance,byleveragingtheheaderwordsofobjectsintheJavaprogramminglanguage[22,23].AsinglePSstripe-lockarraymaybesharedandusedfordierentTL2datastructureswithinasingleaddress-space.ForinstanceanapplicationwithtwodistinctTL2red-blacktreesandthreeTL2hash-tablescoulduseasinglePSarrayforallTL2locks.Asourdefaultmappingwechoseanarrayof220entriesof32-bitlockwordswiththemappingfunctionmaskingthevariableaddresswith\0x3FFFFC"andthenaddinginthebaseaddressofthelockarraytoderivethelockaddress.InthefollowingwedescribethePSversionoftheTL2algorithmalthoughmostofthedetailscarrythroughverbatimforPOaswell.Wemaintainthreadlo-calread-andwrite-setsaslinkedlists.Eachread-setentriescontainstheaddressofthelockthat\covers"thevariablebeingread,andunlikeformeralgorithms,doesnotneedtocontaintheobservedversionnumberofthelock.Thewrite-setentriescontaintheaddressofthevariable,thevaluetobewrittentothevari-able,andtheaddressofitsassociatedlock.Inmanycasesthelockandlocationaddressarerelatedandsoweneedtokeeponlyoneofthemintheread-set.Thewrite-setiskeptinchronologicalordertoavoidwrite-after-writehazards.2.1TheBasicTL2AlgorithmWenowdescribehowTL2executesasequentialcodefragmentthatwasplacedwithinaTL2transaction.Asweexplain,TL2doesnotrequiretrapsorthe insertionofvalidationtestswithinusercode,andinthismodedoesnotrequiretype-stablegarbagecollection,workingseamlesslywiththememorylife-cycleoflanguageslikeCandC++.WriteTransactionsThefollowingsequenceofoperationsisperformedbyawritingtransaction,onethatperformswritestothesharedmemory.Wewillassumethatatransactionisawritingtransaction.Ifitisaread-onlytransac-tionthiscanbedenotedbytheprogrammer,determinedatcompiletimeorheuristicallyatruntime.1.Sampleglobalversion-clock:Loadthecurrentvalueoftheglobalversionclockandstoreitinathreadlocalvariablecalledtheread-versionnumber(rv).Thisvalueislaterusedfordetectionofrecentchangestodataeldsbycomparingittotheversioneldsoftheirversionedwrite-locks.2.Runthroughaspeculativeexecution:Executethetransactioncode(loadandstoreinstructionsaremechanicallyaugmentedandreplacedsothatspeculativeexecutiondoesnotchangethesharedmemory'sstate,hencetheterm\speculative".)Locallymaintainaread-setofaddressesloadedandawrite-setaddress/valuepairsstored.Thisloggingfunctionalityisimple-mentedbyaugmentingloadswithinstructionsthatrecordthereadaddressandreplacingstoreswithcoderecordingtheaddressandvalueto-be-written.Thetransactionalloadrstchecks(usingaBloomlter[24])toseeiftheloadaddressalreadyappearsinthewrite-set.Ifso,thetransactionalloadreturnsthelastvaluewrittentotheaddress.Thisprovidestheillusionofprocessorconsistencyandavoidsread-after-writehazards.Aloadinstructionsamplingtheassociatedlockisinsertedbeforeeachorig-inalload,whichisthenfollowedbypost-validationcodecheckingthatthelocation'sversionedwrite-lockisfreeandhasnotchanged.Additionally,wemakesurethatthelock'sversioneldisrvandthelockbitisclear.Ifitisgreaterthanrvitsuggeststhatthememorylocationhasbeenmodiedafterthecurrentthreadperformedstep1,andthetransactionisaborted.3.Lockthewrite-set:Acquirethelocksinanyconvenientorderusingboun-dedspinningtoavoidindenitedeadlock.Incasenotalloftheselocksaresuccessfullyacquired,thetransactionfails.4.Incrementglobalversion-clock:Uponsuccessfulcompletionoflockac-quisitionofalllocksinthewrite-setperformanincrement-and-fetch(usingaCASoperationforexample)oftheglobalversion-clockrecordingthere-turnedvalueinalocalwrite-versionnumbervariablewv.5.Validatetheread-set:validateforeachlocationintheread-setthattheversionnumberassociatedwiththeversioned-write-lockisrv.Wealsoverifythatthesememorylocationshavenotbeenlockedbyotherthreads.Incasethevalidationfails,thetransactionisaborted.Byre-validatingtheread-set,weguaranteethatitsmemorylocationshavenotbeenmodiedwhilesteps3and4werebeingexecuted.Inthespecialcasewhererv+1=wvitisnotnecessarytovalidatetheread-set,asitisguaranteedthatnoconcurrentlyexecutingtransactioncouldhavemodiedit. 6.Commitandreleasethelocks:Foreachlocationinthewrite-set,storetothelocationthenewvaluefromthewrite-setandreleasethelocationslockbysettingtheversionvaluetothewrite-versionwvandclearingthewrite-lockbit(thisisdoneusingasimplestore).Afewthingstonote.Thewrite-lockshavebeenheldforabrieftimewhenattemptingtocommitthetransaction.Thishelpsimproveperformanceunderhighcontention.TheBloomlterallowsustodetermineifavalueisnotinthewrite-setandneednotbesearchedforbyreadingthesinglelterword.Thoughlockscouldhavebeenacquiredinascendingaddressordertoavoiddeadlock,wefoundthatsortingtheaddressesinthewrite-setwasnotworththeeort.Low-CostRead-OnlyTransactionsOneofthegoalsoftheproposedmetho-dology'sdesignisanecientexecutionofread-onlytransactions,astheydomi-nateusagepatternsinmanyapplications.Toexecutearead-onlytransaction:1.Sampletheglobalversion-clock:Loadthecurrentvalueoftheglobalversion-clockandstoreitinalocalvariablecalledread-version(rv).2.Runthroughaspeculativeexecution:Executethetransactioncode.Eachloadinstructionispost-validatedbycheckingthatthelocation'sver-sionedwrite-lockisfreeandmakingsurethatthelock'sversioneldisrv.Ifitisgreaterthanrvthetransactionisaborted,otherwisecommits.Ascanbeseen,theread-onlyimplementationishighlyecientbecauseitdoesnotconstructorvalidatearead-set.Detectionofread-onlybehaviorcanbedoneatthelevelofofeachspecictransactionsite(e.g.,methodoratomicblock).Thiscanbedoneatcompiletimeorbysimplyrunningallmethodsrstasread-only,andupondetectingthersttransactionalwrite,abortandseta\ragtoindicatethatthismethodshouldhenceforthbeexecutedinwritemode.2.2ALowContentionGlobalVersion-ClockImplementationTherearevariouswaysinwhichonecouldimplementtheglobalversion-clockusedinthealgorithm.Thekeydicultywiththeglobalclockimplementationisthatitmayintroduceincreasedcontentionandcostlycachecoherentsharing.Oneapproachtoreducingthisoverheadisbasedonsplittingtheglobalversion-clockvariablesoitincludesaversionnumberandathreadid.Basedonthissplit,athreadwillnotneedtochangetheversionnumberifitisdierentthantheversionnumberitusedwhenitlastwrote.Insuchacaseallitwillneedtodoiswriteitsownversionnumberinanygivenmemorylocation.Thiscanleadtoanoverallreductionbyafactorofninthenumberofversionclockincrements.1.Eachversionnumberwillincludethethreadidofthethreadthatlastmod-iedit. 2.Eachthread,whenperformingtheload/CAStoincrementtheglobalversion-clock,checksaftertheloadtoseeiftheglobalversion-clockdiersfromthethread'spreviouswv(notethatifitfailsontheCASandretriestheload/CASthenitknowsthenumberwaschanged).Ifitdiers,thenthethreaddoesnotperformtheCAS,andwritestheversionnumberitloadedanditsidintoalllocationsitmodies.Iftheglobalversionnumberhasnotchanged,thethreadmustCASanewglobalversionnumbergreaterbyoneanditsidintotheglobalversionandusethisineachlocation.3.Toread,athreadloadstheglobalversion-clock,andanylocationwithaversionnumberrvor=rvandhavinganiddierentthanthatofthetransactionwholastchangedtheglobalversionwillcauseatransactionfailure.ThishasthepotentialtocutthenumberofCASoperationsontheglobalversion-clockbyalinearfactor.Itdoeshoweverintroducethepossibilityof\falsepositive"failures.Inthesimpleglobalversion-clockwhichisalwaysincremented,areadofsomelocationthatsaw,say,valuev+n,wouldnotfailonthingslessthanv+n,butwiththenewscheme,itcouldbethatthreads1..n-1allperformnon-modifyingincrementsbychangingonlytheidpartofaversion-clock,leavingthevalueunchangedatv,andthereaderalsoreadsvfortheversion-clock(insteadofv+nashewouldhaveintheregularscheme).Itcanthusfailonaccountofeachofthewriteseventhoughintheregularschemeitwouldhaveseenmostofthemwithvaluesv:::v+n 1.2.3MixedTransactionalandNon-TransactionalMemoryManagementThecurrentimplementationofTL2viewsmemoryasbeingclearlydividedintotransactionalandnon-transactional(heap)spacewheremixed-modetransac-tionalandnon-transactionalaccessesareproscribed.Aslongasamemorylo-cationcanbeaccessedbytransactionalloadorstoreoperationsitmustnotbeaccessibletonon-transactionalloadandstoreoperationsandviceversa.Wedohoweverwishtoallowmemoryrecycledfromonespacetobereusableintheother.Fortype-safegarbagecollectedmanagedruntimeenvironmentssuchasthatoftheJavaprogramminglanguage,anyoftheTL2lock-mappingpolicies(PSorPO)providethispropertyastheGCassuresthatmemorywillonlybereleasedoncenoreferencesremaintoanobject.However,inlanguagessuchasCorC++thatprovidetheprogrammerwithexplicitmemorymanagementopera-tionssuchasmallocandfree,wemusttakecarenevertofreeobjectswhiletheyareaccessible.Thepitfallsofndingasolutionforsuchlanguagesareexplainedindetailin[18].Thereisasimplesolutionfortheper-stripe(PS)variationofTL2(andinthetheearlierTL[18]scheme)thatworkswithanymalloc/freeorsimilarstylepairofoperations.Inthetransactionalspace,athreadexecutingatransactioncanonlyreachanobjectbyfollowingasequenceofreferencesthatareincludedinthetransaction'sread-set.Byvalidatingthetransactionbeforewritingthe locationswecanmakesurethatthereadsetisconsistent,guaranteeingthattheobjectisaccessibleandhasnotbeenreclaimed.Transactedmemorylocationsaremodiedafterthetransactionisvalidatedandbeforetheirassociatedlocksarereleased.Thisleavesashortperiodinwhichtheobjectsinthetransaction'swrite-setmustnotbefreed.Topreventobjectsfrombeingfreedinthatperiod,threadsletobjectsquiescebeforefreeingthem.Byquiescingwemeanlettinganyactivityonthetransactionallocationscompletebymakingsurethatalllocksonanobject'sassociatedmemorylocationsarereleasedbytheirownersbefore.Onceanobjectisquiesceditcanbefreed.Thisschemeworksbecauseanytransactionthatmayacquirethelockandreachthedisconnectedlocationwillfailitsread-setvalidation.Unfortunately,wehavenotfoundanecientschemeforusingthePOmodeofTL2inCorC++becauselocksresideinsidetheobjectheader,andtheactofacquiringalockcannotguaranteedtotakeplacewhiletheobjectisalive.Ascanbeseenintheperformancesection,onthebenchmarks/machinewetestedthereisapenalty,thoughnotanunbearableone,forusingPSmodeinsteadofPO.InSTMsthatuseencounter-timelockacquisitionandundo-logs[5,11]itissignicantlyhardertoprotectobjectsfrombeingmodiedaftertheyarereclaimed,asmemorylocationsaremodiedoneatatime,replacingoldvalueswiththenewvalueswrittenbythetransaction.Evenwithquiescing,toprotectfromillegalmemorymodications,onewouldhavetorepeatedlyvalidatetheentiretransactionbeforeupdatingeachlocationinthewrite-set.Thisrepeatedvalidationisinecientinitssimplestformandcomplex(ifatallpossible)ifoneattemptstousethecompilertoeliminateunnecessaryvalidations.2.4MechanicalTransformationofSequentialCodeAswediscussedearlier,thealgorithmwedescribecanbeaddedtocodeinamechanicalfashion,thatis,withoutunderstandinganythingabouthowthecodeworksorwhattheprogramitselfdoes.Inourbenchmarks,weperformedthetransformationbyhand.Wedohoweverbelievethatitmaybefeasibletoautomatethisprocessandallowacompilertoperformthetransformationgivenafewrathersimplelimitationsonthecodestructurewithinatransaction.Wenotethathand-crafteddatastructurescanalwayshaveanadvantageoverTL2,asTL2hasnowayofknowingthatpriorloadsexecutedwithinatransactionmightnolongerhaveanybearingonresultsproducedbytransaction.2.5Software-HardwareInteroperabilityThoughwehavedescribedTL2asasoftwarebasedscheme,itcanbemadeinter-operablewithHTMsystems.Onamachinesupportingdynamichardwaretransactions,transactionsneedonlyverifyforeachlocationreadorwrittenthattheassociatedversionedwrite-lockisfree.Thereisnoneedforthehardwaretransactiontostoreanintermediatelockedstateintothelockword(s).Forev-erywritetheyalsoneedtoupdatetheversionnumberoftheassociatedlock uponcompletion.Thissucestoprovideinteroperabilitybetweenhardwareandsoftwaretransactions.Anysoftwarereadwilldetectconcurrentmodicationsoflocationsbyahardwarewritesbecausetheversionnumberoftheassociatedlockwillhavechanged.Anyhardwaretransactionwillfailifaconcurrentsoft-waretransactionisholdingthelocktowrite.Softwaretransactionsattemptingtowritewillalsofailinacquiringalockonalocationsincelockacquisitionisdoneusinganatomichardwaresynchronizationoperation(suchasCASorasinglelocationtransaction)whichwillfailiftheversionnumberofthelocationwasmodiedbythehardwaretransaction.3EmpiricalPerformanceEvaluationWepresenthereasetofmicrobenchmarksthathavebecomestandardinthecommunity[25],comparingasequentialred-blacktreemadeconcurrentusingvariousalgorithmsrepresentingstate-of-the-artnon-blocking[6]andlock-based[5,18]STMs.Forlackofspacewecanonlypresentthered-blacktreedatastructureandonlyfourperformancegraphs.Thesequentialred-blacktreemadeconcurrentusingourtransactionallockingalgorithmwasderivedfromthejava.util.TreeMapimplementationfoundintheJavaprogramminglanguageJDK6.0.ThatimplementationwaswrittenbyDougLeaandJoshBloch.Inturn,partsoftheJavaTreeMapwerederivedfromtheCormenetal.[26].WewouldhavepreferredtousetheexactFraser-Harrisred-blacktree[6]butthatcodewaswrittentotheirspecictransactionalinterfaceandcouldnotreadilybeconvertedtoasimpleform.Thesequentialred-blacktreeimplementationexposesakey-valuepairinter-faceofput,delete,andgetoperations.Theputoperationinstallsakey-valuepair.Ifthekeyisnotpresentinthedatastructureputwillinsertanewel-ementdescribingthekey-valuepair.Ifthekeyisalreadypresentinthedatastructureputwillsimplyupdatethevalueassociatedwiththeexistingkey.Thegetoperationqueriesthevalueforagivenkey,returninganindicationifthekeywaspresentinthedatastructure.Finally,deleteremovesakeyfromthedatastructure,returninganindicationifthekeywasfoundtobepresentinthedatastructure.Thebenchmarkharnesscallsput,getanddeletetooperateontheunderlyingdatastructure.Theharnessallowsfortheproportionofput,getanddeleteoperationstobevariedbywayofcommandlinearguments,aswellasthenumberofthreads,trialduration,initialnumberofkey-valuepairstobeinstalledinthedatastructure,andthekey-range.Thekeyrangedescribesthemaximumpossiblesize(capacity)ofthedatastructure.Forourexperimentsweuseda16-processorSunFireTMV890whichisacachecoherentmultiprocessorwith1.35GhzUltraSPARC-IVR\rprocessorsrun-ningSolarisTM10.Asclaimedintheintroduction,modernoperatingsystemshandlelockingwell,evenwhenthenumberofthreadsislargerthanthenumberofCPUs.Inourbenchmarks,ourofSTMsusedtheSolarisschedctlmecha-nismtoallowthreadstorequestshort-termpreemptiondeferralbystoringtoathread-speciclocationwhichisreadbythekernel-levelscheduler.Preemption deferralisadvisory-thekernelwilltrytoavoidpreemptingathreadthathasrequesteddeferral.Wenotethatunfortunatelywecouldnotintroducetheuseofschedctl-basedpreemptiondeferralintothehandcraftedlock-basedhankecode,thelock-basedstm ennalscodedescribedbelow,orstm fraser.Thisaf-fectedtheirrelativeperformancebeyond16threadsbutnotintherangebelow16threads.Ourbenchmarkedalgorithmsincluded:MutexWeusedtheSolarisPOSIXthreadslibrarymutexasacoarse-grainedlockingmechanism.stm fraserThisisthestate-of-the-artnon-blockingSTMofHarrisandFraser[6].Weusethenameoriginallygiventotheprogrambyitsauthors.Ithasaspecialrecordperobjectwithapointertoatransactionrecord.Thetransformationofsequentialtotransactionalcodeisnotmechanical:theprogrammerspecieswhenobjectsaretransactionallyopenedandclosedtoimproveperformance.stm ennalsThisisthelock-basedencounter-timeobject-basedSTMalgorithmofEnnalstakenfrom[5]andprovidedinLibLTX[6].NotethatLibLTXincludestheoriginalFraserandHarrislockfree-libpackage.Itusesalockperobjectandanon-mechanicalobject-basedinterfaceof[6].ThoughwedidnothaveaccesstocodefortheSahaetal.algorithm[11],webelievetheEnnalsalgorithmtobeagoodrepresentativethisclassofalgorithms,withthepossiblebenetthattheEnnalsstructureswerewrittenusingthenon-mechanicalobject-basedinterfaceof[6]andbecauseunlikeSahaetal.,Ennalswrite-setisnotsharedamongthreads.TL/POAversionofouralgorithm[18]whichdoesnotuseaglobalversionclock,insteaditcollectsreadandwrite-setsandvalidatestheread-setafteracquiringthelocksonthememorylocations.UnlikeTL2,itthusrequiresasaferunningenvironment.Webringheretheper-objectlockingvariationoftheTLalgorithm.hankeThisisthehand-craftedlock-basedconcurrentrelaxedred-blacktreeimplementationofHanke[27]ascodedbyFraser[6].Theideaofrelaxedbalancingistouncouplethere-balancingfromtheupdatinginordertospeeduptheupdateoperationsandtoallowahighdegreeofconcurrency.TL2Ournewtransactionallockingalgorithm.WeusethenotationTL2/POandTL2/PStodenotetheper-objectandper-stripevariations.ThePOvariationconsistentlyperformedbetterthanPS,butPSiscompatiblewithopenmemorysystems.InFigure1wepresentfourred-blacktreebenchmarksperformedusingtwodierentkeyrangesandtwosetoperationdistributions.Thekeyrangeof[100;200]generatesasmallsizetreewhiletherange[10000;20000]createsalargertree,imposinglargertransactionsizeforthesetoperations.Thedierentoperationdistributionsrepresenttwotypeofworkloads,onedominatedbyreads(5%puts,5%deletes,and90%gets)andtheother(30%puts,30%deletes,and40%gets)dominatedbywrites. 0 2000 4000 6000 8000 10000 12000 14000 5 10 15 20 25 30 1000 X ops/secthreads(a) Small Red-Black Tree 5%/5%/90% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 500 1000 1500 2000 2500 3000 3500 4000 5 10 15 20 25 30 1000 X ops/secthreads(b) Small Red-Black Tree 30%/30%/40% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 2000 4000 6000 8000 10000 12000 14000 5 10 15 20 25 30 1000 X ops/secthreads(c) Large Red-Black Tree 5%/5%/90% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5 10 15 20 25 30 1000 X ops/secthreads(d) Large Red-Black Tree 30%/30%/40% on 16x V890mutex TL2 PO/CMT TL PO/CMT TL2 PS/CMT TL PS/CMT stm_ennals stm_fraser hanke Fig.1.ThroughputofRed-BlackTreewith5%putsand5%deletesand30%puts,30%deletesInallfourgraphs,allalgorithmsscalequitewellto16processors,withtheexceptionofthemutualexclusionbasedone.Ennals'salgorithmperformsbadlyonthecontendedwrite-dominatedbenchmark,apparentlysueringfromfre-quenttransactioncollisions,whicharemorelikelytooccurinencounter-timelockingbasedsolutions.Beyond16threads,theHankeandEnnalsalgorithmsdeterioratebecausewecouldnotintroducetheschedctlmechanismtoallowthreadstorequestshort-termpreemptiondeferral.ItisinterestingtonotethattheFraser-HarrisSTMcontinuestoperformwellbeyond16threadsevenwithoutthismechanismbecauseitisnon-blocking.Asexpected,objectbasedalgorithms(PO)dobetterthanstripe-based(PS)onesbecauseoftheimprovedlocalityinaccessingthelocksandthedata.TheperformanceofalltheSTMimplementationsusuallydiersbyaconstantfactor,mostofwhichweassociatewiththeoverheadsofthealgorithmicmech-anismsemployed(asseeninthesinglethreadperformance).Thehand-craftedalgorithmofHankeprovidesthehighestthroughputinmostcasesbecauseitssinglethreadperformance(ameasureofoverhead)issuperiortoallSTMal-gorithms.OnthesmallerdatastructuresTL/PO(orTL/PS)performsbetterthanTL2/PO(respectivelyTL2/PS)becauseofloweroverheads,partofwhichcanbeassociatedwithinvalidationtraccausedbyupdatesoftheversionclock(thisisnottraccausedbyCASoperationsonthesharedlocation.Itisduetothefactthatthelocationisbeingupdated).ThisisreversedandTL2becomes superiorwhenthedatastructureislargebecausetheread-setislargerandread-onlytransactionsincurlessoverheadinTL2.TheTLandTL2algorithmsareinmostcasessuperiortoEnnals'sSTMandFraserandHarris'sSTM.InallbenchmarkstheyareanorderofmagnitudefasterthanthesinglelockMuteximplementation.4ConclusionTheTL2algorithmpresentedinthispaperprovidesasafeandeasytointegrateSTMimplementationwithreasonableperformance,providingaprogrammingenvironmentsimilartothatattainedusinggloballocks,butwithaten-foldim-provementinperformance.TL2willeasilycombinewithhardwaretransactionalmechanismsoncethesebecomeavailable.Itprovidesastrongindicationthatweshouldcontinuetodeviselock-basedSTMs.ThereishoweverstillmuchworktobedonetoimproveTL2'sperformance.Alotoftheseimprovementsmayrequirehardwaresupport,forexample,inimplementingtheglobalversionclockandinspeedingupthecollectionoftheread-set.ThefullTL2codewillbepubliclyavailableshortly.References1.Herlihy,M.,Moss,E.:Transactionalmemory:Architecturalsupportforlock-freedatastructures.In:ProceedingsoftheTwentiethAnnualInternationalSymposiumonComputerArchitecture.(1993)2.Rajwar,R.,Herlihy,M.,Lai,K.:Virtualizingtransactionalmemory.In:ISCA'05:Proceedingsofthe32ndAnnualInternationalSymposiumonComputerArchitec-ture,Washington,DC,USA,IEEEComputerSociety(2005)494{5053.Ananian,C.S.,Asanovic,K.,Kuszmaul,B.C.,Leiserson,C.E.,Lie,S.:Unboundedtransactionalmemory.In:HPCA'05:Proceedingsofthe11thInternationalSympo-siumonHigh-PerformanceComputerArchitecture,Washington,DC,USA,IEEEComputerSociety(2005)316{3274.Hammond,L.,Wong,V.,Chen,M.,Carlstrom,B.D.,Davis,J.D.,Hertzberg,B.,Prabhu,M.K.,Wijaya,H.,Kozyrakis,C.,Olukotun,K.:Transactionalmemoryco-herenceandconsistency.In:ISCA'04:Proceedingsofthe31stannualinternationalsymposiumonComputerarchitecture,Washington,DC,USA,IEEEComputerSociety(2004)1025.Ennals,R.:Softwaretransactionalmemoryshouldnotbeobstruction-free.www.cambridge.intel-research.net/rennals/notlockfree.pdf(2005)6.Harris,T.,Fraser,K.:Concurrentprogrammingwithoutlocks.www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-cpwl-submission.pdf(2004)7.Herlihy,M.:TheSXMsoftwarepackage.http://www.cs.brown.edu/mph/SXM/README.doc(2005)8.Herlihy,M.,Luchangco,V.,Moir,M.,Scherer,W.:Softwaretransactionalmemoryfordynamicdatastructures.In:Proceedingsofthe22ndAnnualACMSymposiumonPrinciplesofDistributedComputing.(2003)9.Marathe,V.J.,SchererIII,W.N.,Scott,M.L.:Adaptivesoftwaretransactionalmemory.In:Proceedingsofthe19thInternationalSymposiumonDistributedComputing,Cracow,Poland(2005) 10.Moir,M.:HybridTM:Integratinghardwareandsoftwaretransactionalmemory.TechnicalReportArchivist2004-0661,SunMicrosystemsResearch(2004)11.Saha,B.,Adl-Tabatabai,A.R.,Hudson,R.L.,Minh,C.C.,Hertzberg,B.:Ahighperformancesoftwaretransactionalmemorysystemforamulti-coreruntime.In:ToappearinPPoPP2006.(2006)12.Shavit,N.,Touitou,D.:Softwaretransactionalmemory.DistributedComputing10(2)(1997)99{11613.Welc,A.,Jagannathan,S.,Hosking,A.L.:Transactionalmonitorsforconcurrentobjects.In:ProceedingsoftheEuropeanConferenceonObject-OrientedProgram-ming.Volume3086ofLectureNotesinComputerScience.,Springer-Verlag(2004)519{54214.Ananian,C.S.,Rinard,M.:Ecientsoftwaretransactionsforobject-orientedlan-guages.In:ProceedingsofSynchronizationandConcurrencyinObject-OrientedLanguages(SCOOL),ACM(2005)15.Marathe,V.J.,Scherer,W.N.,Scott,M.L.:Designtradeosinmodernsoftwaretransactionalmemorysystems.In:LCR'04:Proceedingsofthe7thworkshoponWorkshoponlanguages,compilers,andrun-timesupportforscalablesystems,NewYork,NY,USA,ACMPress(2004)1{716.Rajwar,R.,Hill,M.:Transactionalmemoryonline.http://www.cs.wisc.edu/trans-memory(2006)17.Harris,T.,Fraser,K.:Languagesupportforlightweighttransactions.SIGPLANNot.38(11)(2003)388{40218.Dice,D.,Shavit,N.:Whatreallymakestransactionsfast?In:TRANSACT06ACMWorkshop.(2006)19.Afek,Y.,Attiya,H.,Dolev,D.,Gafni,E.,Merritt,M.,Shavit,N.:Atomicsnapshotsofsharedmemory.J.ACM40(4)(1993)873{89020.Thomasian,A.:Concurrencycontrol:methods,performance,andanalysis.ACMComput.Surv.30(1)(1998)70{11921.Riegel,T.,Felber,P.,Fetzer,C.:ALazySnapshotAlgorithmwithEagerVal-idation.In:20thInternationalSymposiumonDistributedComputing(DISC).(2006)22.Agesen,O.,Detlefs,D.,Garthwaite,A.,Knippel,R.,Ramakrishna,Y.S.,White,D.:Anecientmeta-lockforimplementingubiquitoussynchronization.ACMSIGPLANNotices34(10)(1999)207{22223.Dice,D.:Implementingfastjavamonitorswithrelaxed-locks.In:JavaVirtualMachineResearchandTechnologySymposium,USENIX(2001)79{9024.Bloom,B.H.:Space/timetrade-osinhashcodingwithallowableerrors.Commun.ACM13(7)(1970)422{42625.Herlihy,M.,Luchangco,V.,Moir,M.,Scherer,III,W.N.:Softwaretransactionalmemoryfordynamic-sizeddatastructures.In:Proceedingsofthetwenty-secondannualsymposiumonPrinciplesofdistributedcomputing,ACMPress(2003)92{10126.Cormen,T.H.,Leiserson,Charles,E.,Rivest,R.L.:IntroductiontoAlgorithms.MITPress(1990)CORth01:11.Ex.27.Hanke,S.:Theperformanceofconcurrentred-blacktreealgorithms.LectureNotesinComputerScience1668(1999)286{300