/
Understanding POWER Multiprocessors Susmit Sarkar Peter Sewell Jade Alglave Luc Maranget Understanding POWER Multiprocessors Susmit Sarkar Peter Sewell Jade Alglave Luc Maranget

Understanding POWER Multiprocessors Susmit Sarkar Peter Sewell Jade Alglave Luc Maranget - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
555 views
Uploaded On 2014-12-18

Understanding POWER Multiprocessors Susmit Sarkar Peter Sewell Jade Alglave Luc Maranget - PPT Presentation

which in turn requires a good understanding of the observable processor behaviour that can be relied on Unfortunately this critical hardwaresoftware interface is not at all clear for several current multiprocessors In this paper we characterise the ID: 25888

which turn requires

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Understanding POWER Multiprocessors Susm..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

UnderstandingPOWERMultiprocessorsSusmitSarkar1PeterSewell1JadeAlglave2;3LucMaranget3DerekWilliams41UniversityofCambridge2OxfordUniversity3INRIA4IBMAustinAbstractExploitingtoday'smultiprocessorsrequireshigh-performanceandcorrectconcurrentsystemscode(op-timisingcompilers,languageruntimes,OSkernels,etc.),whichinturnrequiresagoodunderstandingoftheobservableprocessorbehaviourthatcanbereliedon.Unfortunatelythiscriticalhardware/softwareinterfaceisnotatallclearforseveralcurrentmultiprocessors.InthispaperwecharacterisethebehaviourofIBMPOWERmultiprocessors,whichhaveasubtleandhighlyrelaxedmemorymodel(ARMmultiprocessorshaveaverysimilararchitectureinthisrespect).Wehaveconductedex-tensiveexperimentsonseveralgenerationsofprocessors:POWERG5,5,6,and7.Basedonthese,onpublishedde-tailsofthemicroarchitectures,andondiscussionswithIBMsta ,wegiveanabstract-machinesemanticsthatabstractsfrommostoftheimplementationdetailbutexplainsthebe-haviourofarangeofsubtleexamples.Oursemanticsisex-plainedinprosebutde nedinrigorousmachine-processedmathematics;wealsocon rmthatitcapturestheobserv-ableprocessorbehaviour,orthearchitecturalintent,forourexampleswithanexecutablechecker.Whilenotociallysanctionedbythevendor,webelievethatthismodelgivesareasonablebasisforreasoningaboutcurrentPOWERmul-tiprocessors.Ourworkshouldbringnewclaritytoconcurrentsystemsprogrammingforthesearchitectures,andisanecessarypreconditionforanyanalysisorveri cation.ItshouldalsoinformthedesignoflanguagessuchasCandC++,wherethelanguagememorymodelisconstrainedbywhatcanbeecientlycompiledtosuchmultiprocessors.CategoriesandSubjectDescriptorsC.1.2[MultipleDataStreamArchitectures(Multiprocessors)]:Parallelpro-cessors;D.1.3[ConcurrentProgramming]:Parallelpro-gramming;F.3.1[SpecifyingandVerifyingandReasoningaboutPrograms]GeneralTermsDocumentation,Languages,Reliability,Standardization,Theory,Veri cationKeywordsRelaxedMemoryModels,Semantics1.IntroductionPowermultiprocessors(includingtheIBMPOWER5,6,and7,andvariousPowerPCimplementations)haveforPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforpro torcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthe rstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeci cpermissionand/orafee.PLDI'11,June4{8,2011,SanJose,California,USA.Copyrightc\r2011ACM978-1-4503-0663-8/11/06...$10.00manyyearshadaggressiveimplementations,providinghighperformancebutexposingaveryrelaxedmemorymodel,onethatrequirescarefuluseofdependenciesandmemorybarrierstoenforceorderinginconcurrentcode.Apriori,onemightexpectthebehaviourofamultiprocessortobesu-cientlywell-de nedbythevendorarchitecturedocumenta-tion,herethePowerISAv2.06speci cation[Pow09].Forthesequentialbehaviourofinstructions,thatisveryoftentrue.Forconcurrentcode,however,theobservablebehaviourofPowermultiprocessorsisextremelysubtle,asweshallsee,andtheguaranteesgivenbythevendorspeci cationarenotalwaysclear.Wethereforesetouttodiscovertheac-tualprocessorbehaviourandtode nearigorousandusablesemantics,asafoundationforfuturesystembuildingandresearch.Theprogrammer-observablerelaxed-memorybehaviourofthesemultiprocessorsemergesasawhole-systemprop-ertyfromacomplexmicroarchitecture[SKT+05,LSF+07,KSSF10].Thiscanchangesigni cantlybetweengenerations,e.g.fromPOWER6toPOWER7,butincludes:coresthatperformout-of-orderandspeculativeexecution,withmanyshadowregisters;hierarchicalstorebu ering,withsomebu eringsharedbetweenthreadsofasymmetricmulti-threading(SMT)core,andwithmultiplelevelsofcache;storebu eringpartitionedbyaddress;andacacheprotocolwithmanycache-linestatesandacomplexinterconnectiontopology,andinwhichcache-lineinvalidatemessagesarebu ered.Theimplementationofcoherentmemoryandofthememorybarriersinvolvesmanyofthese,workingto-gether.Tomakeausefulmodel,itisessentialtoabstractfromasmuchaspossibleofthiscomplexity,bothtomakeitsimpleenoughtobecomprehensibleandbecausethede-tailedhardwaredesignsareproprietary(thepublishedlit-eraturedoesnotdescribethemicroarchitectureinenoughdetailtocon dentlypredictalltheobservablebehaviour).Ofcourse,themodelalsohastobesound,allowingallbe-haviourthatthehardwareactuallyexhibits,andsucientlystrong,capturinganyguaranteesprovidedbythehardwarethatsystemsprogrammersrelyon.Itdoesnothavetobetight:itmaybedesirabletomakealoosespeci cation,per-mittingsomebehaviourthatcurrenthardwaredoesnotex-hibit,butwhichprogrammersdonotrelyontheabsenceof,forsimplicityortoadmitdi erentimplementationsinfuture.Themodeldoesnothavetocorrespondindetailtotheinternalstructureofthehardware:wearecapturingtheexternalbehaviourofreasonableimplementations,nottheimplementationsthemselves.Butitshouldhaveaclearab-stractionrelationshiptoimplementationmicroarchitecture,sothatthemodeltrulyexplainsthebehaviourofexamples.Todevelopourmodel,andtoestablishcon dencethatitissound,wehaveconductedextensiveexperiments,run-ningseveralthousandtests,bothhand-writtenandauto-maticallygenerated,onseveralgenerationsofprocessors,forupto11iterationseach.Wepresentsomesimpletestsinx2,tointroducetherelaxedbehaviourallowedbyPower processors,andsomemoresubtleexamplesinx6,withrepre-sentativeexperimentaldatainx7.Toensurethatourmodelexplainsthebehaviouroftestsinawaythatfaithfullyab-stractsfromtheactualhardware,usingappropriatecon-cepts,wedependonextensivediscussionswithIBMsta .Tovalidatethemodelagainstexperiment,webuiltachecker,basedoncodeautomaticallygeneratedfromthemathemati-calde nition,tocalculatetheallowedoutcomesoftests(x8);thiscon rmsthatthemodelgivesthecorrectresultsforalltestswedescribeandforasystematicallygeneratedfamilyofaround300others.Relaxedmemorymodelsaretypicallyexpressedeitherinanaxiomaticoranoperationalstyle.Hereweadoptthelatter,de ninganabstractmachineinx3andx4.Weexpectthatthiswillbemoreintuitivethantypicalaxiomaticmodels,asithasastraightforwardnotionofglobaltime(intracesofabstractmachinetransitions),andtheabstractionfromtheactualhardwareismoredirect.Moreparticularly,toexplainsomeoftheexamples,itseemstobenecessarytomodelout-of-orderandspeculativereadsexplicitly,whichiseasiertodoinanabstract-machinesetting.Thisworkisanexerciseinmakingamodelthatisassimpleaspossiblebutnosimpler:themodelisconsiderablymorecomplexthansome(e.g.forTSOprocessorssuchasSparcandx86),butdoescapturetheprocessorbehaviourorarchitecturalintentforarangeofsubtleexamples.Moreover,whilethede nitionismathematicallyrigorous,itcanbeexplainedinonlyafewpagesofprose,soitshouldbeaccessibletotheexpertsystemsprogrammers(ofconcurrencylibraries,languageruntimes,optimisingcompilers,etc.)whohavetobeconcernedwiththeseissues.Weendwithdiscussionofrelatedwork(x9)andabriefsummaryoffuturedirections(x10),returningatlasttothevendorarchitecture.2.SimpleExamplesWebeginwithaninformalintroductiontoPowermultipro-cessorbehaviourbyexample,introducingsomekeyconceptsbutleavingexplanationintermsofthemodeltolater.2.1RelaxedbehaviourIntheabsenceofmemorybarriersordependencies,Powermultiprocessorsexhibitaveryrelaxedmemorymodel,asshownbytheirbehaviourforthefollowingfourclassictests.SB:StoreBu eringHeretwothreadswritetoshared-memorylocationsandtheneachreadsfromtheotherloca-tion|anidiomattheheartofDekker'smutual-exclusionalgorithm,forexample.Inpseudocode: Thread0 Thread1 x=1 y=1 r1=y r2=x Initialsharedstate:and Allowed nalstate:and Inthespeci edexecutionboththreadsreadthevaluefromtheinitialstate(inlaterexamples,thisiszerounlessoth-erwisestated).Toeliminateanyambiguityaboutexactlywhatmachineinstructionsareexecuted,eitherfromsource-languagesemanticsorcompilationconcerns,wetakethede nitiveversionofourexamplestobeinPowerPCas-sembly(availableonline[SSA+11]),ratherthanpseudocode.Theassemblycodeisnoteasytoread,however,soherewepresentexamplesasdiagramsofthememoryreadandwriteeventsinvolvedintheexecutionspeci edbytheinitialand nalstateconstraints.Inthisexample,thepseudocodeandrepresentmachineregisters,soaccessestothosearenotmemoryevents;withthe nalstateasspeci ed,theonlyconceivableexecutionhastwowrites,labelledaandc,andtworeads,labelledbandd,withvaluesasbelow.Theyarerelatedbyprogramorderpo(laterweelideimpliedpoedges),andthefactthatthetworeadsbothreadfromtheinitialstate(0)isindicatedbytheincomingreads-from(rf)edges(fromwritestoreadsthatreadfromthem);thedotsindicatetheinitial-statewrites. TestSB:Allowed Thread0 a:W[x]=1b:R[y]=0 Thread1 c:W[y]=1d:R[x]=0popo rfrf ThisexampleillustratesthekeyrelaxationallowedinSparcorx86TSOmodels[Spa92,SSO+10].ThenextthreeshowsomewaysinwhichPowergivesaweakermodel.MP:MessagepassingHereThread0writesdataxandthensetsa\ragy,whileThread1readsyfromthat\ragwriteandthenreadsx.OnPowerthatreadisnotguaranteeedtoseetheThread0writeofx;itmightinsteadreadfrom`before'thatwrite,despitethechainofpoandrfedges: TestMP:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0po rf po rf Inrealcode,thereadcofymightbeinaloop,repeateduntilthevaluereadis1.Here,tosimplifyexperimentaltesting,wedonothavealoopbutinsteadconsideronlyexecutionsinwhichthevaluereadis1,expressedwithaconstraintonthe nalregistervaluesinthetestsource.WRC:Write-to-ReadCausalityHereThread0com-municatestoThread1bywritingx=1.Thread1readsthat,andthenlater(inprogramorder)sendsamessagetoThread2bywritingintoy.HavingreadthatwriteofyatThread2,thequestioniswhetheraprogram-order-subsequentreadofxatThread2isguaranteedtoseethevaluewrittenbytheThread0write,ormightreadfrom`before'that,asshown,againdespitetherfandpochain.OnPowerthatispossible. TestWRC:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf po rf po rf IRIW:IndependentReadsofIndependentWritesHeretwothreads(0and2)writetodistinctlocationswhiletwoothers(1and3)eachreadfrombothlocations.Inthespeci edallowedexecution,theyseethetwowritesindi erentorders(Thread1's rstreadseesthewritetoxbuttheprogram-order-subsequentreaddoesnotseethewriteofy,whereasThread3seesthewritetoybutnotthattox). TestIRIW:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:R[y]=0 Thread2 d:W[y]=1e:R[y]=1 Thread3 f:R[x]=0 rf po rf po rfrf CoherenceDespitealltheabove,onedoesgetaguaran-teeofcoherence:inanyexecution,foreachlocation,thereisasinglelinearorder(co)ofallwrites(byanyprocessor)tothatlocation,whichmustberespectedbyallthreads.Thefourcasesbelowillustratethis:apairofreadsbyathreadcannotreadcontrarytothecoherenceorder(CoRR1);thecoherenceordermustrespectprogramorderforapairofwritesbyathread(CoWW);areadcannotreadfromawritethatiscoherence-hiddenbyanotherwriteprogram-order-precedingtheread(CoWR),andawritecannotcoherence-order-precedeawritethataprogram-order-precedingreadreadfrom.Wecannowclarifythe`before'intheMPandWRCdiscussionabove,whichwaswithrespecttotheco-herenceorderforx. TestCoRR1:Forbidden Thread0 a:W[x]=1b:R[x]=1 Thread1 c:R[x]=0 rf po rf TestCoWW:Forbidden Thread0 b:W[x]=2a:W[x]=1 co po TestCoWR:Forbidden Thread0 a:W[x]=1b:R[x]=2 Thread1 c:W[x]=2po co rf TestCoRW:Forbidden Thread0 a:R[x]=2b:W[x]=1c:W[x]=2 Thread1 po co rf 2.2EnforcingorderingThePowerISAprovidesseveralwaystoenforcestrongerordering.Herewedealwiththesync(heavyweightsync,orhwsync)andlwsync(lightweightsync)barrierinstructions,andwithdependenciesandtheisyncinstruction,leavingload-reserve/store-conditionalpairsandeieiotofuturework.Regainingsequentialconsistency(SC)usingsyncIfoneaddsasyncbetweeneveryprogram-orderpairofinstructions(creatingtestsSB+syncs,MP+syncs,WRC+syncs,andIRIW+syncs),thenallthenon-SCresultsaboveareforbidden,e.g. TestMP+syncs:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf sync rf UsingdependenciesBarrierscanincurasigni cantruntimecost,andinsomecasesenoughorderingisguaran-teedsimplybytheexistenceofadependencyfromamemoryreadtoanothermemoryaccess.Therearevariouskinds:Thereisanaddressdependency(addr)fromareadtoaprogram-order-latermemoryreadorwriteifthereisadata\rowpathfromtheread,throughregistersandarithmetic/logicaloperations(butnotthroughothermemoryaccesses),totheaddressofthesecondreadorwrite.Thereisadatadependency(data)fromareadtoamemorywriteifthereissuchapathtothevaluewritten.Addressanddatadependenciesbehavesimilarly.Thereisacontroldependency(ctrl)fromareadtoamemorywriteifthereissuchadata\rowpathtothetestofaconditionalbranchthatisaprogram-order-predecessorofthewrite.Wealsorefertocontroldepen-denciesfromareadtoaread,butorderingofthereadsinthatcaseisnotrespectedingeneral.Thereisacontrol+isyncdependency(ctrlisync)fromareadtoanothermemoryreadifthereissuchadata\rowpathfromthe rstreadtothetestofaconditionalbranchthatprogram-order-precedesanisyncinstructionbeforethesecondread.Sometimesonecanusedependenciesthatarenaturallypresentinanalgorithm,butitcanbedesirabletointroduceonearti cially,foritsorderingproperties,e.g.byXOR'ingavaluewithitselfandaddingthattoanaddresscalculation.Dependenciesaloneareusuallynotenough.Forexam-ple,addingdependenciesbetweenread/readandread/writepairs,givingtestsWRC+data+addr(withadatadepen-dencyonThread1andanaddressdependencyonThread2),andIRIW+addrs(withaddressdependenciesonThreads1and3),leavesthenon-SCbehaviourallowed.OnecannotadddependenciestoSB,asthatonlyhaswrite/readpairs,andonecanonlyaddadependencytotheread/readsideofMP,leavingthewritesunconstrainedandthenon-SCbe-haviourstillallowed.Incombinationwithabarrier,however,dependenciescanbeveryuseful.Forexample,MP+sync+addrisSC: TestMP+sync+addr:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf addr rf Herethebarrierkeepsthewritesinorder,asseenbyanythread,andtheaddressdependencykeepsthereadsinorder.Contrarytowhatonemightexpect,thecombinationofathread-localreads-fromedgeandadependencydoesnotguaranteeorderingofawrite-writepair,asseenbyanotherthread;thetwowritescanpropagateineitherorder(here[x]=zinitially): TestMP+nondep+sync:Allowed Thread0 a:W[x]=yb:R[x]=yc:W[y]=1d:R[y]=1 Thread1 e:R[x]=z rf addr rf sync rf Controldependencies,observablespeculativereads,andisyncRecallthatcontroldependencies(withoutisync)areonlyrespectedfromreadstowrites,notfromreadstoreads.IfonereplacestheaddressdependencyinMP+sync+addrbyadata\rowpathtoaconditionalbranchbeforethesecondread(givingthetestnamedMP+sync+ctrlbelow),thatdoesnotensurethatthereadsonThread1bindtheirvaluesinprogramorder. TestMP+sync+ctrl:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf ctrl rf Addinganisyncinstructionbetweenthebranchandthesecondread(givingtestMP+sync+ctrlisync)suces.Thefactthatdata/addressdependenciestobothreadsandwritesarerespectedwhilecontroldependenciesareonlrespectedtowritesisimportantinthedesignofC++0xlow-levelatomics[BA08,BOS+11],whererelease/consumeatomicsletonetakeadvantageofdatadependencieswithoutrequiringbarriers(andlimitingoptimisation)toensurethatallsource-languagecontroldependenciesarerespected.CumulativityForWRCitsucestohaveasynconThread1withadependencyonThread2;thenon-SCbehaviouristhenforbidden: TestWRC+sync+addr:Forbidden Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf sync rf addr rf ThisillustrateswhatwecallA-cumulativityofPowerbarri-ers:achainofedgesbeforethebarrierthatisrespected.InthiscaseThread1readsfromtheThread0writebefore(inprogramorder)executingasync,andthenThread1writestoanotherlocation;anyotherthread(here2)isguaranteedtoseetheThread0writebeforetheThread1write.How-ever,swappingthesyncanddependency,e.g.withjustanrfanddataedgebetweenwritesaandc,doesnotguaranteeorderingofthosetwowritesasseenbyanotherthread: TestWRC+data+sync:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf data rf sync rf IncontrasttothatWRC+data+sync,achainofreads-fromedgesanddependenciesafterasyncdoesensurethatorderingbetweenawritebeforethesyncandawriteafterthesyncisrespected,asbelow.Herethereadseandfofzandxcannotseethewritesaanddoutoforder.WecallthisaB-cumulativityproperty. TestISA2+sync+data+addr:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:W[z]=1e:R[z]=1 Thread2 f:R[x]=0 sync rf data rf addr rf UsinglwsyncThelwsyncbarrierisbroadlysimilartosync,includingcumulativityproperties,exceptthatdoesnotorderstore/loadpairsanditischeapertoexecute;itsuf- cestoguaranteeSCbehaviourinMP+lwsyncs(MPwithlwsyncineachthread),WRC+lwsync+addr(WRCwithlwsynconThread1andanaddressdependencyonThread2),andISA2+lwsync+data+addr,whileSB+lwsyncsandIRIW+lwsyncsarestillallowed.Wereturnlatertootherdi erencesbetweensyncandlwsync.3.TheModelDesignWedescribethehigh-leveldesignofourmodelinthissec-tion,givingthedetailsinthenext.Webuildourmodelasacompositionofasetof(hardware)threadsandasinglestoragesubsystem,synchronisingonvariousmessages: Write requestRead requestBarrier request Storage SubsystemThreadThread Read-request/read-responsepairsaretightlycoupled,whiletheothersaresingleunidirectionalmessages.Thereisnobu eringbetweenthetwoparts.Coherence-by- atOurstoragesubsystemabstractscompletelyfromtheprocessorimplementationstore-bu eringandcachehierarchy,andfromthecacheprotocol:ourmodelhasnoexplicitmemory,eitherofthesystemasawhole,orofanycacheorstorequeue(thefactthatonecanabstractfromalltheseisitselfinteresting).Instead,weworkintermsofthewriteeventsthatareadcanreadfrom.Ourstoragesubsystemmaintains,foreachaddress,thecur-rentconstraintonthecoherenceorderamongthewritesithasseentothataddress,asastrictpartialorder(transitivebutirre\rexive).Forexample,supposethestoragesubsystemhasseenfourwrites,w0,w1,w2andw3,alltothesamead-dress.Itmighthavebuiltupthecoherenceconstraintontheleftbelow,withw0knowntobebeforew1,w2andw3,andw1knowntobebeforew2,butwithas-yet-undeterminedrelationshipsbetweenw1andw3,andbetweenw2andw3. 02310231 Thestoragesubsystemalsorecordsthelistofwritesthatithaspropagatedtoeachthread:thosesentinresponsetoread-requests,thosedonebythethreaditself,andthosepropagatedtothatthreadintheprocessofpropagatingabarriertothatthread.Theseareinterleavedwithrecordsofbarrierspropagatedtothatthread.Notethatthisisastorage-subsystem-modelconcept:thewritespropagatedtoathreadhavenotnecessarilybeensenttothethreadmodelinaread-response.Now,givenareadrequestbyathreadtid,whatwritescouldbesentinresponse?Fromthestateontheleftabove,ifthewritespropagatedtothreadtidarejust[w1],perhapsbecausetidhasreadfromw1,then:itcannotbesentw0,asw0iscoherence-beforethew1writethat(becauseitisinthewrites-propagatedlist)itmighthavereadfrom;itcouldre-readfromw1,leavingthecoherenceconstraintunchanged;itcouldbesentw2,againleavingthecoherenceconstraintunchanged,inwhichcasew2mustbeappendedtotheeventspropagatedtotid;oritcouldbesentw3,againappendingthistotheeventspropagatedtotid,whichmoreoverentailscommittingtow3beingcoherence-afterw1,asinthecoherencecon-straintontherightabove.Notethatthisstillleavestherelativeorderofw2andw3unconstrained,soanother threadcouldbesentw2thenw3or(inadi erentrun)theotherwayaround(orindeedjustone,orneither).Inthemodelthisbehaviourissplitupintoseparatestorage-subsystemtransitions:thereisoneruleformakinganewco-herencecommitmentbetweentwohitherto-unrelatedwritestothesameaddress,oneruleforpropagatingawritetoanewthread(whichcanonly reaftersucientcoherencecommitmentshavebeenmade),andoneruleforreturningareadvaluetoathreadinresponsetoaread-request.Thelastalwaysreturnsthemostrecentwrite(tothereadaddress)inthelistofeventspropagatedtothereadingthread,whichthereforeservesessentiallyasaper-threadmemory(thoughitrecordsmoreinformationthanjustanarrayofbytes).Weadopttheseseparatetransitions(inwhatwetermapartialcoherencecommitmentstyle)tomakeiteasytorelatemodeltransitionstoactualhardwareimplementationevents:co-herencecommitmentscorrespondtowrites\rowingthroughjoinpointsinahierarchical-store-bu erimplementation.Out-of-orderandSpeculativeExecutionAsweshallseeinx6,manyoftheobservablesubtletiesofPowerbe-haviourarisefromthefactthatthethreadscanperformreadinstructionsout-of-orderandspeculatively,subjecttoaninstructionbeingrestartedifacon\rictingwritecomesinbeforetheinstructioniscommitted,orbeingabortedifabranchturnsouttobemispredicted.However,writesarenotsenttothestoragesubsystembeforetheirinstructionsarecommitted,andwedonotseeobservablevaluespeculation.Accordingly,ourthreadmodelpermitsveryliberalout-of-orderexecution,withunboundedspeculativeexecutionpastas-yet-unresolvedbranchesandpastsomebarriers,whileourstoragesubsystemmodelneednotbeconcernedwithspec-ulation,retry,oraborts.Ontheotherhand,thestoragesubsystemmaintainsthecurrentcoherenceconstraint,asabove,whilethethreadmodeldoesnotneedtohaveac-cesstothis;thethreadmodelplaysitspartinmaintainingcoherencebyissuingrequestsinreasonableorders,andinaborting/retryingasnecessary.Foreachthreadwehaveatreeofthecommittedandin-\rightinstructioninstances.Newlyfetchedinstructionsbecomein-\right,andlater,subjecttoappropriateprecondi-tions,canbecommitted.Forexample,belowweshowasetofinstructioninstancesi1;:::;i13withtheprogram-order-successorrelationamongthem.Threeofthose(i1;i3;i4,boxed)havebeencommitted;theremainderarein-\right. i1i2i3i4i5i6i8i7i9i10i13i11i12 Instructioninstancesi5andi9arebranchesforwhichthethreadhasfetchedmultiplepossiblesuccessors;herejusttwo,butforabranchwithacomputedaddressitmightfetchmanypossiblesuccessors.Notethatthecommittedinstancesarenotnecessarilycontiguous:herei3andi4havebeencommittedeventhoughi2hasnot,whichcanonlyhappeniftheyaresucientlyindependent.Whenabranchiscommittedthenanyun-takenalternativepathsarediscarded,andinstructionsthatfollow(inprogramorder)anuncommittedbranchcannotbecommitteduntilthatbranchis,sothetreemustbelinearbeforeanycommitted(boxed)instructions.Inimplementations,readsareretriedwhencache-lineinvalidatesareprocessed.Inthemodel,toabstractfromexactlywhenthishappens(andfromwhatevertrackingthecoredoesofwhichinstructionsmustberetriedwhenitdoes),weadoptalateinvalidatesemantics,retryingappropriatereads(andtheirdependencies)whenareadorwriteiscommitted.Forexample,considertworeads1and2inprogramorderthathavebeensatis edfromtwodi erentwritestothesameaddress,with2satis ed rst(out-of-order),fromw1,and1satis edlaterfromthecoherence-laterw2.When1iscommitted,2mustberestarted,otherwisetherewouldbeacoherenceviolation.(Thisreliesonthefactthatwritesareneverprovidedbythestoragesubsystemoutofcoherenceorder;thethreadmodeldoesnotneedtorecordthecoherenceorderexplicitly.)Dependenciesaredealtwithentirelybythethreadmodel,intermsoftheregistersreadandwrittenbyeachinstructioninstance(theregisterfootprintsofinstructionsareknownstatically).Memoryreadscannotbesatis eduntiltheiraddressesaredetermined(thoughperhapsstillsubjecttochangeonretry),andmemorywritescannotbecommitteduntiltheiraddressesandvaluesarefullydeter-mined.Wedonotmodelregisterrenamingandshadowregis-tersexplicitly,butourout-of-orderexecutionmodelincludestheire ect,asregisterreadstakevaluesfromprogram-order-precedingregisterwrites.Barriers(syncandlwsync)andcumulativity-by- atThesemanticsofbarriersinvolvesbothpartsofthemodel,asfollows.Whenthestoragesubsystemreceivesabarrierrequest,itrecordsthebarrieraspropagatedtoitsownthread,markingapointinthesequenceofwritesthathavebeenpropagatedtothatthread.ThosewritesaretheGroupAwritesforthisbarrier.WhenalltheGroupAwrites(orsomecoherence-successorsthereof)ofabarrierhavebeenpropagatedtoan-otherthread,thestoragesubsystemcanrecordthatfactalso,propagatingthebarriertothatthread(therebymark-ingapointinthesequenceofwritesthathavebeenprop-agatedtothatthread).Awritecannotbepropagatedtoathreadtiduntilallrelevantbarriersarepropagatedtotid,wheretherelevantbarriersarethosethatwerepropagatedtothewritingthreadbeforethewriteitself.Inturn(bytheabove),thatmeansthattheGroupAwritesofthosebarri-ers(orsomecoherencesuccessors)mustalreadyhavebeenpropagatedtotid.Thismodelsthee ectofcumulativitywhileabstractingfromthedetailsofhowitisimplemented.Moreover,asyncbarriercanbeacknowledgedbacktotheoriginatingthreadwhenallofitsGroupAwriteshavebeenpropagatedtoallthreads.Inthethreadmodel,barriersconstrainthecommitorder.Forexample,nomemoryloadorstoreinstructioncanbecommitteduntilallprevioussyncbarriersarecommittedandacknowledged;andsyncandlwsyncbarrierscannotbecommitteduntilallpreviousmemoryreadsandwriteshavebeen.Moreover,memoryreadscannotbesatis eduntilprevioussyncbarriersarecommittedandacknowledged.Therearevariouspossiblemodellingchoicesherewhichshouldnotmakeanyobservabledi erence|theabovecorrespondstoamoderatelyaggressiveimplementation.4.TheModelinDetailWenowdetailtheinterfacebetweenthestoragesubsystemandthreadmodels,andthestatesandtransitionsofeach.Thetransitionsaredescribedinx4.3andx4.5intermsof theirprecondition,theire ectontherelevantstate,andthemessagessentorreceived.Transitionsareatomic,andsynchroniseasshowninFig.1;messagesarenotbu ered.Thisisaprosedescriptionofourmathematicalde nitions,availableon-line[SSA+11].4.1TheStorageSubsystem/ThreadInterfaceThestoragesubsystemandthreadsexchangemessages:awriterequest(orwrite)wspeci esthewritingthreadtid,uniquerequestideid,addressa,andvaluev.areadrequestspeci estheoriginatingthreadtid,requestideid,andaddressa.areadresponsespeci estheoriginatingthreadtid,re-questideid,andawritew(itselfspecifyingthethreadtid0thatdidthewrite,itsideid0,addressa,andvaluev).Thisissentwhenthevaluereadisbound.abarrierrequestspeci estheoriginatingthreadtid,requestideid,andbarriertypeb(syncorlwsync).abarrierackspeci estheoriginatingthreadtidandrequestideid(abarrierackisonlysentforsyncbarriers,afterthebarrierispropagatedtoallthreads.4.2StorageSubsystemStatesAstoragesubsystemstateshasthefollowingcomponents.s:threadsisthesetofthreadidsthatexistinthesystem.s:writes seenisthesetofallwritesthatthestoragesubsystemhasseen.s:coherenceisthecurrentconstraintonthecoherenceorder,amongthewritesthatthestoragesubsystemhasseen.Itisabinaryrelation:s:coherencecontainsthepair(w1;w2)ifthestoragesubsystemhascommittedtowritew1beingbeforewritew2inthecoherenceorder.Thisrelationgrowsovertime,withnewpairsbeingadded,asthestoragesubsystemmakesadditionalcommitments.Foreachaddress,s:coherenceisastrictpartialorderoverthewriterequestsseentothataddress.Itdoesnotrelatewritestodi erentaddresses,orrelateanywritethathasnotbeenseenbythestoragesubsystemtoanywrite.s:events propagated togives,foreachthread,alistof:1.allwritesdonebythatthreaditself,2.allwritesbyotherthreadsthathavebeenpropagatedtothisthread,and3.allbarriersthathavebeenpropagatedtothisthread.Werefertothosewritesasthewritesthathavebeenpropagatedtothatthread.TheGroupAwritesforabarrierareallthewritesthathavebeenpropagatedtothebarrier'sthreadbeforethebarrieris.s:unacknowledged sync requestsisthesetofsyncbar-rierrequeststhatthestoragesubsystemhasnotyetac-knowedged.Aninitialstateforthestoragesubsystemhasthesetofthreadidsthatexistinthesystem,exactlyonewriteforeachmemoryaddress,allofwhichhavebeenpropagatedtoallthreads(thisensuresthattheywillbecoherence-beforeanyotherwritetothataddress),anemptycoherenceorder,andnounacknowledgedsyncrequests.4.3StorageSubsystemTransitionsAcceptwriterequestAwriterequestbyathreadtidcanalwaysbeaccepted.Action:1.addthenewwritetos:writes seen,torecordthenewwriteasseenbythestoragesubsystem;2.appendthenewwritetos:events propagated to(tid),torecordthenewwriteaspropagatedtoitsownthread;and3.updates:coherencetonotethatthenewwriteiscoherence-afterallwrites(tothesameaddress)thathavepreviouslypropagatedtothisthread.PartialcoherencecommitmentThestoragesubsys-temcaninternallycommittoamoreconstrainedcoherenceorderforaparticularaddress,addinganarbitraryedge(be-tweenapairofwritestothataddressthathavebeenseenalreadythatarenotyetrelatedbycoherence)tos:coherence,togetherwithanyedgesimpliedbytransitivity,ifthereisnocycleintheunionoftheresultingcoherenceorderandthesetofallpairsofwrites(w1;w2),toanyaddress,forwhichw1andw2areseparatedbyabarrierinthelistofeventspropagatedtothethreadofw2.Action:Addthenewedgestos:coherence.PropagatewritetoanotherthreadThestoragesub-systemcanpropagateawritew(bythreadtid)thatithasseentoanotherthreadtid0,if:1.thewritehasnotyetbeenpropagatedtotid0;2.wiscoherence-afteranywritetothesameaddressthathasalreadybeenpropagatedtotid0;and3.allbarriersthatwerepropagatedtotidbeforew(ins:events propagated to(tid))havealreadybeenpropa-gatedtotid0.Action:appendwtos:events propagated to(tid0).SendareadresponsetoathreadThestoragesubsys-temcanacceptaread-requestbyathreadtidatanytime,andreplywiththemostrecentwritewtothesameaddressthathasbeenpropagatedtotid.Therequestandresponsearetightlycoupledintooneatomictransition.Action:sendaread-responsemessagecontainingwtotid.AcceptbarrierrequestAbarrierrequestfromathreadtidcanalwaysbeaccepted.Action:1.appendittos:events propagated to(tid),torecordthebarrieraspropagatedtoitsownthread(andthereby xthesetofGroupAwritesforthisbarrier);and2.(forsync)addittos:unacknowledged sync requests.PropagatebarriertoanotherthreadThestoragesubsystemcanpropagateabarrierithasseentoanotherthreadif:1.thebarrierhasnotyetbeenpropagatedtothatthread;and2.foreachGroupAwrite,thatwrite(orsomecoherencesuccessor)hasalreadybeenpropagatedtothatthreadAction:appendthebarriertos:events propagated to(tid).AcknowledgesyncbarrierAsyncbarrierbcanbeac-knowledgedifithasbeenpropagatedtoallthreads.Action:1.sendabarrier-ackmessagetotheoriginatingthread;and2.removebfroms:unacknowledged sync requests. StorageSubsystemRule Message(s) ThreadRule Acceptwriterequest writerequest Commitin-\rightinstruction Partialcoherencecommitment Propagatewritetoanotherthread Sendareadresponsetoathread readrequest/readresponse Satisfymemoryreadfromstoragesubsystem Satisfymemoryreadbyforwardinganin-\rightwrite Acceptbarrierrequest barrierrequest Commitin-\rightinstruction Propagatebarriertoanotherthread Acknowledgesyncbarrier barrierack Acceptsyncbarrieracknowledgement Fetchinstruction Registerreadfrompreviousregisterwrite Registerreadfrominitialregisterstate Internalcomputationstep Figure1.StorageSubsystemandThreadSynchronisation4.4ThreadStatesThestatetofasinglehardwarethreadconsistsof:itsthreadid.theinitialvaluesforallregisters,t:initial register state.asett:committed instructionsofcommittedinstructioninstances.Alltheiroperationshavebeenexecutedandtheyarenotsubjecttorestartorabort.asett:in \right instructionsofin-\rightinstructionin-stances.Thesehavebeenfetchedandsomeoftheasso-ciatedinstruction-semanticsmicro-operationsmayhavebeenexecuted.However,noneoftheassociatedwritesorbarriershavebeensenttothestoragesubsystem,andanyin-\rightinstructionissubjecttobeingaborted(togetherwithallofitsdependents).asett:unacknowledged syncsofsyncbarriersthathavenotbeenacknowledgedbythestoragesubsystem.Aninitialstateforathreadhasnocommittedorin-\rightinstructionsandnounacknowledgedsyncbarriers.Eachinstructioninstanceiconsistsofauniqueid,arep-resentationofthecurrentstateofitsinstructionseman-tics,thenamesofitsinputandoutputregisters,thesetofwritesthatithasreadfrom,theinstructionaddress,theprogram-order-previousinstructioninstanceid,andanvalueconstraintrequiredtoreachthisinstructioninstancefromthepreviousinstance.Theinstructionsemanticsex-ecutesinsteps,doinginternalcomputation,registerreadsandwrites,memoryreads,and, nally,memorywritesorbarriers.4.5ThreadTransitionsFetchinstructionAninstructioninstcanbefetched,fol-lowingitsprogram-orderpredecessoriprevandfromaddressa,if1.aisapossiblenextfetchaddressforiprev;and2.instistheinstructionoftheprogramata.Thepossiblenextfetchaddressesallowspeculationpastcalculatedjumpsandconditionalbranches;theyarede nedas:1.foranon-branch/jumpinstruction,thesuccessorinstruc-tionaddress;2.forajumptoaconstantaddress,thataddress;3.forajumptoanaddresswhichisnotyetfullydetermined(i.e.,wherethereisanuncommittedinstructionwithadata\rowpathtotheaddress),anyaddress;and4.foraconditionalbranch,thepossibleaddressesforajumptogetherwiththesuccessor.Action:constructaninitializedinstructioninstanceandaddittothesetofin-\rightinstructions.Thisisaninternalactionofthethread,notinvolvingthestoragesubsystem,asweassumea xedprogramratherthanmodellingfetcheswithreads;wedonotmodelself-modifyingcode.Commitin-\rightinstructionAnin-\rightinstructioncanbecommittedif:1.itsinstructionsemanticshasnopendingreads(memoryorregister)orinternalcomputation(dataoraddressarithmetic);2.allinstructionswithadata\rowdependencytothisin-struction(instructionswithregisteroutputsfeedingtothisinstruction'sregisterinputs)arecommitted;3.allprogram-order-previousbranchesarecommitted;4.ifamemoryloadorstoreisinvolved,allprogram-order-previousinstructionswhichmightaccessitsaddress(i.e.,whichhaveanas-yet-undeterminedaddressorwhichhaveadeterminedaddresswhichequalsthatone)arecommitted;5.ifamemoryloadorstoreisinvolved,orthisinstructionisasync,lwsync,orisync,then(a)allprevioussync,lwsyncandisyncinstructionsarecommitted,and(b)thereisnounacknowledgedsyncbarrierbythisthread;6.ifasyncorlwsyncinstruction,allpreviousmemoryaccessinstructionsarecommitted;7.ifanisync,thenallprogram-order-previousinstructionswhichaccessmemoryhavetheiraddressesfullydeter-mined,whereby`fullydetermined'wemeanthatallin-structionsthatarethesourceofincomingdata\rowde-pendenciestotherelevantaddressarecommittedandanyinternaladdresscomputationisdone.Action:notethattheinstructionisnowcommitted,and:1.ifawriteinstruction,restartanyin-\rightmemoryreads(andtheirdata\rowdependents)thathavereadfromthesameaddress,butfromadi erentwrite(andwherethereadcouldnothavebeenbyforwardinganin-\rightwrite);2.ifareadinstruction, ndallin-\rightprogram-ordersuccessorsthathavereadfromadi erentwritetothesameaddress,orwhichfollowalwsyncbarrierprogram-orderafterthisinstruction,andrestartthemandtheirdata\rowdependents; 3.ifthisisabranch,abortanyuntakenspeculativepathsofexecution,i.e.,anyinstructioninstancesthatarenotreachablebythebranchtaken;and4.sendanywriterequestsorbarrierrequestsasrequiredbytheinstructionsemantics.AcceptsyncbarrieracknowledgementAsyncbar-rieracknowledgementcanalwaysbeaccepted(therewillalwaysbeacommittedsyncwhosebarrierhasamatch-ingeid).Action:removethecorrespondingbarrierfromt:unacknowledged syncs.SatisfymemoryreadfromstoragesubsystemApendingreadrequestintheinstructionsemanticsofanin-\rightinstructioncanbesatis edbymakingaread-requestandgettingaread-responsecontainingawritefromthestoragesubsystemif:1.theaddresstoreadisdetermined(i.e.,anyotherreadswithadata\rowpathtotheaddresshavebeensatis ed,thoughnotnecessarilycommitted,andanyarithmeticonsuchapathcompleted);2.allprogram-order-previoussyncsarecommittedandac-knowledged;and3.allprogram-order-previousisyncsarecommitted.Action:1.updatetheinternalstateofthereadinginstruction;and2.notethatthewritehasbeenreadfrombythatinstruc-tion.Theremainingtransitionsareallthread-internalsteps.Satisfymemoryreadbyforwardinganin-\rightwritedirectlytoreadinginstructionApendingmemorywritewfromanin-\right(uncommitted)instructioncanbeforwardeddirectlytoareadofaninstructioniif1.wisanuncommittedwritetothesameaddressthatisprogram-orderbeforetheread,andthereisnoprogram-order-interveningmemorywritethatmightbetothesameaddress;2.alli'sprogram-order-previoussyncsarecommittedandacknowledged;and3.alli'sprogram-order-previousisyncsarecommitted.Action:asinthesatisfymemoryreadfromstoragesubsystemruleabove.RegisterreadfrompreviousregisterwriteAregisterreadcanreadfromaprogram-order-previousregisterwriteifthelatteristhelastwritetothesameregisterprogram-orderbeforeit.Action:updatetheinternalstateofthein-\rightreadinginstruction.RegisterreadfrominitialregisterstateAregisterreadcanreadfromtheinitialregisterstateifthereisnowritetothesameregisterprogram-orderbeforeit.Action:updatetheinternalstateofthein-\rightreadinginstruction.InternalcomputationstepAnin-\rightinstructioncanperformaninternalcomputationstepifitssemanticshasapendinginternaltransition,e.g.foranarithmeticoperation.Action:updatetheinternalstateofthein-\rightinstruction.4.6FinalstatesThe nalstatesarethosewithnotransitions.Itshouldbethecasethatforallsuch,allinstructioninstancesarecommitted.5.ExplainingthesimpleexamplesTheabstractmachineexplainstheallowedandforbiddenbehaviourforallthesimpletestswesawbefore.Forexample,inoutline:MPTheThread0write-requestsforxandycouldbein-orderornot,buteitherway,becausetheyaretodi erentaddresses,theycanbepropagatedtoThread1ineitherorder.Moreover,eveniftheyarepropagatedinprogramorder,theThread1readofxcanbesatis ed rst(seeingtheinitialstate),thenthereadofy,andtheycouldbecommittedineitherorder.MP+sync+ctrl(controldependency)HerethesynckeepsthepropagationofthewritestoThread1inorder,buttheThread1readofxcanbesatis edspeculatively,beforetheconditionalbranchofthecontroldependencyisresolvedandbeforetheprogram-order-precedingThread1readofyissatis ed;thenthetworeadscanbecommittedinprogramorder.MP+sync+ctrlisync(isync)AddingisyncbetweentheconditionalbranchandtheThread1readofxpreventsthatreadbeingsatis eduntiltheisynciscommitted,whichcannothappenuntiltheprogram-order-previousbranchiscommitted,whichcannothappenuntilthe rstreadissatis edandcommitted.WRC+sync+addr(A-cumulativity)TheThread0write-requestfora:W[x]=1mustbemade,andthewritepropagatedtoThread1,forbtoread1fromit.Thread1thenmakesabarrierrequestforitssync,andthatisprop-agatedtoThread1aftera(sothewriteaisintheGroupAsetforthisbarrier),beforemakingthewrite-requestforc:W[y]=1.ThatwritemustbepropagatedtoThread2fordtoreadfromit,butbeforethatispossiblethesyncmustbepropagatedtoThread2,andbeforethatispossibleamustbepropagatedtoThread2.Meanwhile,thedependencyonThread2meansthattheaddressofthereadeisnotknown,andsoecannotbesatis ed,untilreaddhasbeensatis ed(fromc).AsthatcannotbeuntilafteraispropagatedtoThread2,reademustread1froma,not0fromtheinitialstate.WRC+data+syncHere,incontrast,whiletheThread0/Thread1reads-fromrelationshipandtheThread1dependencyensurethatthewrite-requestsfora:W[x]=1andc:W[y]=1aremadeinthatorder,andtheThread2synckeepsitsreadsinorder,theorderthatthewritesarepropagatedtoThread2isunconstrained.ISA2(B-cumulativity)IntheISA2+sync+data+addrB-cumulativityexample,theThread0writerequestsandbarrierrequestmustreachthestoragesubsysteminprogramorder,soGroupAforthesyncisaandthesyncispropagatedtoThread0beforethebwriterequestreachesthestoragesubsystem.Forctoreadfromb,thelattermusthavebeenpropagatedtoThread1,whichrequiresthesynctobepropagatedtoThread1,whichinturnrequirestheGroupAwriteatohavebeenpropagatedtoThread1.Now,theThread1dependencymeansthatdcannotbecommittedbeforethereadcissatis ed(andcommitted),andhencedmustbeafterthesyncispropagatedtoThread1.Finally,foretoreadfromd,thelattermusthavebeenpropagatedtoThread2,forwhichthesyncmustbepropagatedtoThread2,andhencetheGroupAwriteapropagatedtoThread2.TheThread2dependencymeansthatfcannot besatis eduntileis,soitmustreadfroma,notfromtheinitialstate.Thesameresultandreasoningholdforthelwsyncvariantofthistest(notethatthereasoningdidnotinvolvesyncacksoranymemoryreadsprogram-order-afterthesync).IRIW+syncsHerethetwosyncs(onThreads1and3)havethecorrespondingwrites(aandd)intheirGroupAsets,andhencethosewritesmustbepropagatedtoallthreadsbeforetherespectivesyncsareacknowledged,whichmusthappenbeforetheprogram-order-subsequentreadscandfcanbebesatis ed.Butforthosetoread0,fromcoherence-predecessorsofaandd,thelattermustnothavebeenpropagatedtoallthreads(inparticular,theymustnothavebeenpropagatedtoThreads3and1respectively).Inotherwords,forthistohappentherewouldhavetobeacycleinabstract-machineexecutiontime: Thread1syncacknowledgementThread1c:R[y]=0issatis eda:W[x]=1propagatedtolastthreadd:W[y]=1propagatedtolastthreadThread3syncacknowledgementThread3d:R[x]=0issatis ed Withlwsyncsinsteadofsyncs,thebehaviourisallowed,becauselwsyncsdonothaveananalogousacknowledgementwhentheirGroupAwriteshavebeenpropagatedtoallthreads,andmemoryreadsdonotwaitforpreviouslwsyncstoreachthatpoint.6.Not-so-simpleexamplesWenowdiscusssomemoresubtlebehaviours,explainingeachintermsofourmodel.WriteforwardingInthePPOCAvariantofMPbelow,fisaddress-dependentone,whichreadsfromthewrited,whichiscontrol-dependentonc.Onemightexpectthatchaintopreventreadfbindingitsvalue(withthesatisfymemoryreadfromstoragesubsystemrule)beforecdoes,butinfactinsomeimplementationsfcanbindout-of-order,asshown|thewritedcanbeforwardeddirectlytoewithinthethread,beforethewriteiscommittedtothestoragesubsystem.Thesatisfymemoryreadbyforwardinganin-\rightwriterulemodelsthis.Replacingthecontroldependencywithadatadependency(testPPOAA,notshown)removesthatpossibility,forbiddingthegivenresultoncurrenthardware,asfarasourexperimentalresultsshow,andinourmodel.Thecurrentarchitecturetext[Pow09]leavesthePPOAAoutcomeunspeci ed,butweanticipatethatfutureversionswillexplicitlyforbidit. TestPPOCA:Allowed Thread0 a:W[z]=1b:W[y]=1c:R[y]=1 Thread1 d:W[x]=1e:R[x]=1f:R[z]=0 sync rf ctrl rf addr rf TestRSW:Allowed Thread0 a:W[z]=1b:W[y]=2c:R[y]=2 Thread1 d:R[x]=0e:R[x]=0f:R[z]=0 sync rf addr po addr rfrfrf Aggressivelyout-of-orderreadsInthereads-from-same-writes(RSW)variantofMPabove,thetworeadsofx,dande,happentoreadfromthesamewrite(theinitialstate).Inthiscase,despitethefactthatdandearereadingfromthesameaddress,thee/fpaircansatisfytheirreadsout-of-order,beforethec/dpair,permittingtheoutcomeshown.Theaddressofeisknown,soitcanbesatis edearly,whiletheaddressofdisnotknownuntilitsaddressdependencyoncisresolved.Incontrast,inanexecutioninwhichdandereadfromdi erentwritestox(testRDW,notshown),withanotherwritetoxbyanotherthread),thatisforbidden|inthemodel,thecommitofthe rstread(d)wouldforcearestartofthesecond(e),togetherwithitsdependencies(includingf),ifehadinitiallyreadfromadi erentwritetod.Inactualimplementationstherestartmightbeearlier,whenaninvalidateisprocessed,butwillhavethesameobservablee ect.Coherenceandlwsync:blw-w-006Thisexampleshowsthatonecannotassumethatthetransitiveclosureoflwsyncandcoherenceedgesguaranteesorderingofwritepairs,whichisachallengeforover-simpli edmodels.Inourabstractmachine,thefactthatthestoragesubsystemcom-mitstobbeingbeforecinthecoherenceorderhasnoe ectontheorderinwhichwritesaanddpropagatetoThread2.Thread1doesnotreadfromeitherThread0write,sotheyneednotbesenttoThread1,sonocumulativityisinplay. Testblw-w-006:Allowed Thread0 a:W[x]=1b:W[y]=1c:W[y]=2 Thread1 d:W[z]=1e:R[z]=1 Thread2 f:R[x]=0 lwsync co lwsync rf addr rf Insomeimplementations,andinthemodel,replacingbothlwsyncsbysyncs(bsync-w-006)forbidsthisbehaviour.Inthemodel,itwouldrequireacycleinabstract-machineexecutiontime,fromthepointatwhichapropagatestoitslastthread,totheThread0syncack,tothebwriteaccept,tocpropagatingtoThread0,tocpropagatingtoitslastthread,totheThread1syncack,tothedwriteaccept,todpropagatingtoThread2,toebeingsatis ed,tofbeingsatis ed,toapropagatingtoThread2,toapropagatingtoitslastthread.Thecurrentarchitecturetextagainleavesthisunspeci ed,butonewouldexpectthataddingsynceverywhere(or,inthiscase,anaddressdependencybetweentworeads)shouldregainSC.Coherenceandlwsync:2+2WandR01The2+2W+lwsyncsexamplebelowisacasewheretheinterac-tionofcoherenceandlwsyncsdoesforbidsomebehaviour.Withoutthelwsyncs(2+2W),thegivenexecutionisal-lowed.Withthem,thewritesmustbecommittedinprogramorder,butafteronepartialcoherencecommitment(saydbe-forea)isdone,theother(bbeforec)isnolongerpermitted.(Asthistesthasonlywrites,itmaybehelpfultonotethatthecoherenceorderedgesherecouldbeobservedeitherbyreadingthe nalstateorwithadditionalthreadsreadingxtwiceandytwice.Testingbothversionsgivesthesamere-sult.)Thisexampleisachallengeforaxiomaticmodelswithavieworderperthread,assomethingisneededtobreakthesymmetry.Thegivenbehaviourisalsoforbiddenfortheversionwithsyncs(2+2W+syncs). Test2+2W+lwsyncs:Forbidden Thread0 a:W[x]=1b:W[y]=2c:W[y]=1 Thread1 d:W[x]=2 lwsynclwsync coco TestR01:Allowed Thread0 a:W[x]=1b:W[y]=1c:W[y]=2 Thread1 d:R[x]=0 lwsync co sync rf TheR01testontherightaboveisarelatedcasewherewehavenotobservedthegivenallowedbehaviourinpractice,butitisnotcurrentlyforbiddenbythearchitecture,andourmodelpermitsit.Inthemodel,thewritescanallreachthestoragesubsystem,theb/cpartialcoherencecommitmentbemade,cbepropagatedtoThread0,thesyncbecommittedandacknowledged,anddbesatis ed,allbeforeaandthelwsyncpropagatetoThread1.LBand(no)thin-airreadsThisLBdualoftheSBexampleisanothercasewherewehavenotobservedthegivenallowedbehaviourinpractice,butitisclearlyarchi-tecturallyintended,soprogrammersshouldassumethatfu-tureprocessorsmightpermitit,andourmodeldoes.Addingdataoraddressdependencies(e.g.inLB+datas)shouldfor-bidthegivenbehaviour(thedatadependencycasecouldinvolveout-of-thin-airreads),butexperimentaltestinghereisvacuous,asLBitselfisnotobserved. TestLB:Allowed Thread0 a:R[x]=1b:W[y]=1c:R[y]=1 Thread1 d:W[x]=1popo rfrf RegistershadowingAdiretal.[AAS03]giveanothervariantofLB(whichwecallLB+rs,forRegisterShadow-ing),withadependencyonThread1butre-usingthesameregisteronThread0,todemonstratetheobservabilityofshadowregisters.Thatisalsoallowedinourmodelbutnotobservableinourtests|unsurprisingly,giventhatwehavenotobservedLBitself.However,thefollowingvariantofMPdoesexhibitobservableregistershadowing:thetwousesofr3onThread1donotpreventthesecondreadbeingsatis- edout-of-order,ifthereadsareintoshadowregisters.Thereuseofaregisterisnotrepresentedinourdiagrams,soforthisexamplewehavetogivetheunderlyingPowerPCassemblycode. TestMP+sync+rs:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf po rf Thread0 Thread1 lir1,1 lwzr3,0(r2) stwr1,0(r2) mrr1,r3 sync lwzr3,0(r4) lir3,1 stwr3,0(r4) Initialstate:0:r2=x^0:r4=y ^1:r2=y^1:r4=x Allowed:1:r1=1^1:r3=0 7.ExperimentsonhardwareTheuseofsmalllitmus-testprogramsfordiscussingthebe-haviourofrelaxedmemorymodelsiswell-established,butmostpreviouswork(withnotableexceptions)doesnotem-piricallyinvestigatehowtestsbehaveonactualhardware.Weuseourtool[AMSS11]toruntestsonmachineswithvariousPowerprocessors:PowerG5(akaPowerPC970MP,basedonaPOWER4core),POWER5,POWER6,andPOWER7.ThetooltakestestsinPowerPCassem-blyandrunstheminatestharnessdesignedtostresstheprocessor,toincreasethelikelihoodofinterestingresults.Thisisblack-boxtesting,andonecannotmakeanyde niteconclusionsfromtheabsenceofsomeobservation,butourexperienceisthatthetoolisratherdiscriminating,identify-ingmanyissueswithpreviousmodels(and[AMSS10]reportthediscoveryofaprocessorerratumusingit).Ourworkisalsounusualintherangeandnumberoftestsused.Forthispaperwehaveestablishedalibrarybasedontestsfromtheliterature[Pow09,BA08,AAS03,ARM08],newhand-writtentests(e.g.thePPOCA,PPOAA,RSW,RDW,and2+2Winx6,andmanyothers),andsystematicvariationsofseveraltests(SB,MP,WRC,IRIW,ISA2,LB,andtwoothers,RWCandWWC)withallpossiblecombi-nationsofbarriersordependencies;wecallthisthe\VAR3"family,of314tests.WeranalloftheseonPowerG5,6,and7.Inaddition,weusethetool[AMSS10]tosys-tematicallygenerateseveralthousandinterestingtestswithcyclesofedges(dependencies,reads-from,coherence,etc.)ofincreasingsize,andtestedsomeofthese.Asanimpor-tantstylepoint,weusetestswithconstraintsonthe nalvalues(andhenceonthevaluesread)ratherthanloops,tomakethemeasilytestable.Wegiveanexcerptofourexper-imentalresultsbelow,togivethe\ravour;moreareavail-ableonline[SSA+11].Forexample,PPOCAwasobservableonPOWERG5(1.0k/3.1G),notobservableonPOWER6,andthenobservableagainonPOWER7|consistentwiththelessaggressivelyout-of-ordermicroarchitectureofPOWER6. Test Model POWER6 POWER7 WRC Allow ok970k/12G ok23M/93G WRC+data+addr Allow ok562k/12G ok94k/93G WRC+syncs Forbid ok0/16G ok0/110G WRC+sync+addr Forbid ok0/16G ok0/110G WRC+lwsync+addr Forbid ok0/16G ok0/110G WRC+data+sync Allow ok150k/12G ok56k/94G PPOCA Allow unseen0/39G ok62k/141G PPOAA Forbid ok0/39G ok0/157G LB Allow unseen0/31G unseen0/176G Theinterplaybetweenmanualtesting,systematictest-ing,anddiscussionwithIBMsta hasbeenessentialtothedevelopmentofourmodel.Forexample:thePPOCA/PPOAAbehaviourwasdiscoveredinmanualtest-ing,leadingustoconjecturethatitshouldbeexplainedbywrite-forwarding,whichwaslatercon rmedbydiscussion;theblw-w-006test,foundinsystematictesting,highlighteddicultieswithcoherenceandlwsyncinanearliermodel;andtheroleofcoherenceandsyncacknowledgementsinthecurrentimplementationsarosefromdiscussion.8.ExecutingthemodelTheintricacyofrelaxedmemorymodels(andthenumberoftestsweconsider)makeitessentialalsotohavetoolsup-portforexploringthemodel,toautomaticallycalculatetheoutcomesthatthemodelpermitsforalitmustest,andtocomparethemagainstthoseobservedinpractice.Toeasemodeldevelopment,andalsotoretaincon denceinthetool,itskernelshouldbeautomaticallygeneratedfromthemodelde nition,nothand-coded.Ourabstractmachineisde nedinLem,anewlightweightlanguageformachine-formalisedmathematicalde nitions,oftypes,functionsandinductiverelations[OBZNS].FromthiswegenerateHOL4provercode(andthenceanautomaticallytypesetversionofthemachine)andexecutableOCamlcode,usinga nitesetlibrary,forthepreconditionandactionofeachtransitionrule.Wealsoformalisedasymbolicoperationalsemanticsforthetinyfragmentoftheinstructionsetneededforourtests.Usingthose,webuildanexhaustivememoisedsearchproce-durethat ndsallpossibleabstract-machineexecutionsforlitmustests.Thishascon rmedthatthemodelhastheexpectedbehaviourforthe41testswementionbynameinthispaper,fortherestoftheVAR3familyof314systematictests, andforvariousothertests.InmostcasesthemodelexactlymatchesthePower7experimentalresults,withtheexceptionofafewwhereitincludestheexperimentaloutcomesbutisintentionallylooser;thisappliesto60testsoutofourbatchof333.Speci cally:themodelallowsinstructionstocommitoutofprogramorder,whichpermitstheLBandLB+rstestoutcomes(notobservedinpractice);themodelalsoallowsanisynctocommiteveninthepresenceofpreviouslyuncommittedmemoryaccesses,whereasthespeci edoutcomesoftestssuchasWRCwithanlwsyncandisynchavenotbeenobserved;andtheR01testoutcomeisnotobserved.Inallthesecasesthemodelfollowsthearchitecturalintent,ascon rmedwithIBMsta .Ourexperimentalresultsalsocon rmthatPowerG5and6arestrictlystrongerthanPower7(thoughindi erentways):wehavenotseenanytestoutcomeonthosewhichisnotalsoobservableonPower7.Themodelisthusalsosoundforthose,tothebestofourknowledge.9.RelatedWorkTherehasbeenextensivepreviousworkonrelaxedmem-orymodels.Wefocusonmodelsforthemajorcurrentpro-cessorfamiliesthatdonothavesequentiallyconsistentbe-haviour:Sparc,x86,Itanium,ARM,andPower.EarlyworkbyCollier[Col92]developedmodelsbasedonempiricaltest-ingforthemultiprocessorsoftheday.ForSparc,theven-dordocumentationhasaclearTotalStoreOrdering(TSO)model[SFC91,Spa92].ItalsointroducesPSOandRMOmodels,butthesearenotusedinpractice.Forx86,thevendorintentionswereuntilrecentlyquiteunclear,aswasthebehaviourofprocessorimplementations.TheworkbySarkar,Owens,etal.[SSZN+09,OSS09,SSO+10]suggeststhatfornormaluser-orsystem-codetheyarealsoTSO.Theirworkisinasimilarspirittoourown,withamech-anisedsemanticsthatistestedagainstempiricalobserva-tion.ItaniumprovidesamuchweakermodelthanTSO,butonewhichismorepreciselyde nedbythevendorthanx86[Int02];ithasalsobeenformalisedinTLA[JLM+03]andinhigher-orderlogic[YGLS03].ForPower,therehavebeenseveralpreviousmodels,butnonearesatisfactoryforreasoningaboutrealisticconcurrentcode.Inpartthisisbecausethearchitecturehaschangedovertime:thelwsyncbarrierhasbeenadded,andbarri-ersarenowcumulative.Corella,StoneandBarton[CSB93]gaveanearlyaxiomaticmodelforPowerPC,but,asAdiretal.note[AAS03],thismodelis\rawed(itpermitsthenon-SC nalstateoftheMP+syncsexampleweshowinx2).StoneandFitzgeraldlatergaveaprosedescriptionofPowerPCmemoryorder,largelyintermsofthemicroarchitectureofthetime[SF95].Gharachorloo[Gha95]givesavarietyofmodelsfordi erentarchitecturesinageneralframework,butthemodelforthePowerPCisdescribedas\approxi-mate";itisapparentlybasedonCorellaetal.[CSB93]andonMayetal.[MSSW94].AdveandGharachorloo[AG96]makeclearthatPowerPCisveryrelaxed,butdonotdis-cusstheintricaciesofdependency-inducedordering,orthemoremodernbarriers.Adir,Attiya,andShurekgiveade-tailedaxiomaticmodel[AAS03],intermsofavieworderforeachthread.Themodelwas\developedthroughaniterativeprocessofsuccessivere nements,numerousdiscussionswiththePowerPCarchitects,andanalysisofexamplesandcoun-terexamples",anditsconsequencesforanumberoflitmustests(someofwhichweusehere)aredescribedindetail.Thesefactsinspiresomecon dence,butitisnoteasytounderstandtheforceoftheaxioms,anditdescribesnon-cumulativebarriers,followingthepre-PPC1.09PowerPCarchitecture;currentprocessorsappeartobequitedi er-ent.Morerecently,ChongandIshtiaqgiveapreliminarymodelforARM[CI08],whichhasaverysimilararchitectedmemorymodeltoPower.Inourinitialworkinthisarea[AFI+09],wegaveanaxiomaticmodelbasedonareadingofthePowerISA2.05andARMARMspeci cations,withexperimentalresultsforafewtests(describedasworkinprogress);thisseemstobecorrectforsomeaspectsbuttogiveanunusablyweaksemanticstobarriers.Morerecently,wegavearatherdi erentaxiomaticmodel[AMSS10],furtherdevelopedinAlglave'sthe-sis[Alg10]asaninstanceofageneralframework;itmodelsthenon-multi-copy-atomicnatureofPower(withexamplessuchasIRIW+addrscorrectlyallowed)inasimpleglobal-timesetting.Theaxiomaticmodelissoundwithrespecttoourexperimentaltests,andonthatbasiscanbeusedforreasoning,butitisweakerthantheobservedbehaviourorarchitecturalintentforsomeimportantexamples.More-over,itwasbasedprincipallyonblack-boxtestinganditsrelationshiptotheactualprocessorimplementationsislessclearthanthatfortheoperationalmodelwepresenthere,whichismore rmlygroundedonmicroarchitecturalandar-chitecturaldiscussion.Inmoredetail,theaxiomaticmodelisweakerthanonemightwantforlwsyncandforcumulativity:itallowsMP+lwsync+addrandISA2+sync+data+addr,whicharenotobservedandwhichareintendedtobear-chitecturallyforbidden.ItalsoforbidsR01,whichisnotob-servedbutarchitecturallyintendedtobeallowed,andwhichisallowedbythemodelgivenhere.Thetwomodelsarethusincomparable.WementionalsoLea'sJSR-133CookbookforCompilerWriters[Lea],whichgivesinformal(andapproximate)mod-elsforseveralmultiprocessors,andwhichhighlightstheneedforclearmodels.10.ConclusionTosummariseourcontribution,wehavecharacterisedtheactualbehaviourofPowermultiprocessors,byexampleandbygivingasemanticmodel.Ourexamplesincludenewtestsillustratingseveralpreviouslyundescribedphenomena,togetherwithvariationsofclassictestsandalargesuiteofautomaticallygeneratedtests;wehaveexperimentallyinvestigatedtheirbehaviouronarangeofprocessors.Ourmodelis:rigorous(inmachine-typecheckedmathematics);experimentallyvalidated;accessible(inanabstractmachinestyle,anddetailedhereinafewpagesofprose);usable(aswitnessedbytheexplanationsofexamples);supportedbyatool,forcalculatingthepossibleoutcomesoftests;andsucienttoexplainthesubtlebehaviourexposedbyourexamplesandtesting.Itisanewabstraction,maintainingcoherenceandcumulativitypropertiesby atbutmodellingout-of-orderandspeculativeexecutionexplicitly.ThemodelshouldprovideagoodintuitionfordevelopersofconcurrentsystemscodeforPowermultiprocessors,e.g.ofconcurrencylibraries,languageruntimes,OSkernels,andoptimisingcompilers.Moreover,astheARMarchitecturememorymodelisverysimilar,itmaywellbeapplicable(withminoradaptation)toARM.Themodelalsoopensupmanydirectionsforfuturere-searchinveri cationtheoryandtools.Forexample,itisnowpossibletostateresultsaboutthecorrectcompilationoftheC++0xconcurrencyprimitivestoPowerprocessors,andtoconsiderbarrier-anddependency-awareoptimisa-tionsinthatcontext.Wehavefocussedhereprimarilyon theactualbehaviourofimplementations,butthereisalsoworkrequiredtoidentifytheguaranteesthatprogrammersactuallyrelyon,whichmaybesomewhatweaker|someofourmoreexoticexamplesarenotnaturaluse-cases,tothebestofourknowledge.Thereisalsofutureworkrequiredtobroadenthescopeofthemodel,whichherecoversonlycacheablememorywithoutmixed-sizeaccesses.Wedescribedourmodelprincipallyinitsowntermsandintermsoftheobservedbehaviour,withoutgoingintode-tailsoftherelationshipbetweenthemodelandtheunderly-ingmicroarchitecture,orwiththevendorarchitecturespec-i cation[Pow09];thisremainsfuturework.Acentralnotionofthememorymodeltextinthelatteristhatofwhenamemoryreadorwritebyonethreadisperformedwithre-specttoanother,whichhasahypothetical(orsubjunctive)de nition,e.g.forloads:\AloadbyaprocessorP1isper-formedwithrespecttoaprocessorP2whenthevaluetobereturnedcannolongerbechangedbyastorebyP2",wherethatP2storemaynotevenbepresentintheprogramun-derconsideration(again,ARMissimilar).Thisde nitionmadeperfectsenseintheoriginalwhite-boxsetting[DSB86],wheretheinternalstructureofthesystemwasknownandonecanimaginethehypotheticalstorebyP2appearingatsomeinternalinterface,butintheblack-boxsettingofacommercialmultiprocessor,itishardorimpossibletomakeitprecise,especiallywithexamplessuchasPPOCA.Ourabstract-machinemodelmayprovideasteppingstoneto-wardsimprovedarchitecturalde nitions,perhapsvianewaxiomaticcharacterisations.AcknowledgementsWeacknowledgefundingfromEP-SRCgrantsEP/F036345,EP/H005633,andEP/H027351,fromANRprojectParSec(ANR-06-SETIN-010),andfromINRIAassociatedteamMM.References[AAS03]A.Adir,H.Attiya,andG.Shurek.Information-\rowmodelsforsharedmemorywithanapplicationtothePowerPCarchitecture.IEEETrans.ParallelDistrib.Syst.,14(5):502{515,2003.[AFI+09]J.Alglave,A.Fox,S.Ishtiaq,M.O.Myreen,S.Sarkar,P.Sewell,andF.ZappaNardelli.These-manticsofPowerandARMmultiprocessormachinecode.InProc.DAMP2009,January2009.[AG96]S.V.AdveandK.Gharachorloo.Sharedmemoryconsistencymodels:Atutorial.IEEEComputer,29(12):66{76,1996.[Alg10]JadeAlglave.ASharedMemoryPoetics.PhDthesis,UniversiteParis7{DenisDiderot,November2010.[AMSS10]J.Alglave,L.Maranget,S.Sarkar,andP.Sewell.Fencesinweakmemorymodels.InProc.CAV,2010.[AMSS11]J.Alglave,L.Maranget,S.Sarkar,andP.Sewell.Litmus:Runningtestsagainsthardware.InProc.TACAS,2011.[ARM08]ARM.ARMBarrierLitmusTestsandCookbook,October2008.PRD03-GENC-0078262.0.[BA08]H.-J.BoehmandS.Adve.FoundationsoftheC++concurrencymemorymodel.InProc.PLDI,2008.[BOS+11]M.Batty,S.Owens,S.Sarkar,P.Sewell,andT.Weber.MathematizingC++concurrency.InProc.POPL,2011.[CI08]N.ChongandS.Ishtiaq.ReasoningabouttheARMweaklyconsistentmemorymodel.InMSPC,2008.[Col92]W.W.Collier.Reasoningaboutparallelarchitectures.Prentice-Hall,Inc.,1992.[CSB93]F.Corella,J.M.Stone,andC.M.Barton.Aformalspeci cationofthePowerPCsharedmemoryarchi-tecture.TechnicalReportRC18638,IBM,1993.[DSB86]M.Dubois,C.Scheurich,andF.Briggs.Memoryaccessbu eringinmultiprocessors.InISCA,1986.[Gha95]K.Gharachorloo.Memoryconsistencymodelsforshared-memorymultiprocessors.WRLResearchRe-port,95(9),1995.[Int02]Intel.Aformalspeci cationofIntelItaniumproces-sorfamilymemoryordering,2002.developer.intel.com/design/itanium/downloads/251429.htm.[JLM+03]R.Joshi,L.Lamport,J.Matthews,S.Tasiran,M.Tuttle,andY.Yu.Checkingcache-coherenceprotocolswithTLA+.Form.MethodsSyst.Des.,22:125{131,March2003.[KSSF10]R.Kalla,B.Sinharoy,W.J.Starke,andM.Floyd.Power7:IBM'snext-generationserverprocessor.IEEEMicro,30:7{15,March2010.[Lea]D.Lea.TheJSR-133cookbookforcompilerwriters.http://gee.cs.oswego.edu/dl/jmm/cookbook.html.[LSF+07]H.Q.Le,W.J.Starke,J.S.Fields,F.P.O'Connell,D.Q.Nguyen,B.J.Ronchetti,W.Sauer,E.M.Schwarz,andM.T.Vaden.IBMPOWER6microar-chitecture.IBMJournalofResearchandDevelop-ment,51(6):639{662,2007.[MSSW94]C.May,E.Silha,R.Simpson,andH.Warren,edi-tors.ThePowerPCarchitecture:aspeci cationforanewfamilyofRISCprocessors.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1994.[OBZNS]S.Owens,P.Bohm,F.ZappaNardelli,andP.Sewell.Lightweighttoolsforheavyweightsemantics.Sub-mittedforpublicationhttp://www.cl.cam.ac.uk/~so294/lem/.[OSS09]S.Owens,S.Sarkar,andP.Sewell.Abetterx86memorymodel:x86-TSO.InProc.TPHOLs,pages391{407,2009.[Pow09]PowerISATMVersion2.06.IBM,2009.[SF95]J.M.StoneandR.P.Fitzgerald.StorageinthePowerPC.IEEEMicro,15:50{58,April1995.[SFC91]P.S.Sindhu,J.-M.Frailong,andM.Cekleov.Formalspeci cationofmemorymodels.InScalableSharedMemoryMultiprocessors,pages25{42.Kluwer,1991.[SKT+05]B.Sinharoy,R.N.Kalla,J.M.Tendler,R.J.Eicke-meyer,andJ.B.Joyner.POWER5systemmicroar-chitecture.IBMJournalofResearchandDevelop-ment,49(4-5):505{522,2005.[Spa92]TheSPARCArchitectureManual,V.8.SPARCInternational,Inc.,1992.RevisionSAV080SI9308.http://www.sparc.org/standards/V8.pdf.[SSA+11]S.Sarkar,P.Sewell,J.Alglave,L.Maranget,andD.Williams.UnderstandingPOWERmul-tiprocessors.www.cl.cam.ac.uk/users/pes20/ppc-supplemental,2011.[SSO+10]P.Sewell,S.Sarkar,S.Owens,F.ZappaNardelli,andM.O.Myreen.x86-TSO:Arigorousandusableprogrammer'smodelforx86multiprocessors.Com-municationsoftheACM,53(7):89{97,July2010.[SSZN+09]S.Sarkar,P.Sewell,F.ZappaNardelli,S.Owens,T.Ridge,T.Braibant,M.Myreen,andJ.Alglave.Thesemanticsofx86-CCmultiprocessormachinecode.InProc.POPL2009,January2009.[YGLS03]Y.Yang,G.Gopalakrishnan,G.Lindstrom,andK.Slind.AnalyzingtheIntelItaniummemoryor-deringrulesusinglogicprogrammingandSAT.InProc.CHARME,LNCS2860,2003.