which in turn requires a good understanding of the observable processor behaviour that can be relied on Unfortunately this critical hardwaresoftware interface is not at all clear for several current multiprocessors In this paper we characterise the ID: 25888
Download Pdf The PPT/PDF document "Understanding POWER Multiprocessors Susm..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
UnderstandingPOWERMultiprocessorsSusmitSarkar1PeterSewell1JadeAlglave2;3LucMaranget3DerekWilliams41UniversityofCambridge2OxfordUniversity3INRIA4IBMAustinAbstractExploitingtoday'smultiprocessorsrequireshigh-performanceandcorrectconcurrentsystemscode(op-timisingcompilers,languageruntimes,OSkernels,etc.),whichinturnrequiresagoodunderstandingoftheobservableprocessorbehaviourthatcanbereliedon.Unfortunatelythiscriticalhardware/softwareinterfaceisnotatallclearforseveralcurrentmultiprocessors.InthispaperwecharacterisethebehaviourofIBMPOWERmultiprocessors,whichhaveasubtleandhighlyrelaxedmemorymodel(ARMmultiprocessorshaveaverysimilararchitectureinthisrespect).Wehaveconductedex-tensiveexperimentsonseveralgenerationsofprocessors:POWERG5,5,6,and7.Basedonthese,onpublishedde-tailsofthemicroarchitectures,andondiscussionswithIBMsta,wegiveanabstract-machinesemanticsthatabstractsfrommostoftheimplementationdetailbutexplainsthebe-haviourofarangeofsubtleexamples.Oursemanticsisex-plainedinprosebutdenedinrigorousmachine-processedmathematics;wealsoconrmthatitcapturestheobserv-ableprocessorbehaviour,orthearchitecturalintent,forourexampleswithanexecutablechecker.Whilenotociallysanctionedbythevendor,webelievethatthismodelgivesareasonablebasisforreasoningaboutcurrentPOWERmul-tiprocessors.Ourworkshouldbringnewclaritytoconcurrentsystemsprogrammingforthesearchitectures,andisanecessarypreconditionforanyanalysisorverication.ItshouldalsoinformthedesignoflanguagessuchasCandC++,wherethelanguagememorymodelisconstrainedbywhatcanbeecientlycompiledtosuchmultiprocessors.CategoriesandSubjectDescriptorsC.1.2[MultipleDataStreamArchitectures(Multiprocessors)]:Parallelpro-cessors;D.1.3[ConcurrentProgramming]:Parallelpro-gramming;F.3.1[SpecifyingandVerifyingandReasoningaboutPrograms]GeneralTermsDocumentation,Languages,Reliability,Standardization,Theory,VericationKeywordsRelaxedMemoryModels,Semantics1.IntroductionPowermultiprocessors(includingtheIBMPOWER5,6,and7,andvariousPowerPCimplementations)haveforPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.PLDI'11,June4{8,2011,SanJose,California,USA.Copyrightc\r2011ACM978-1-4503-0663-8/11/06...$10.00manyyearshadaggressiveimplementations,providinghighperformancebutexposingaveryrelaxedmemorymodel,onethatrequirescarefuluseofdependenciesandmemorybarrierstoenforceorderinginconcurrentcode.Apriori,onemightexpectthebehaviourofamultiprocessortobesu-cientlywell-denedbythevendorarchitecturedocumenta-tion,herethePowerISAv2.06specication[Pow09].Forthesequentialbehaviourofinstructions,thatisveryoftentrue.Forconcurrentcode,however,theobservablebehaviourofPowermultiprocessorsisextremelysubtle,asweshallsee,andtheguaranteesgivenbythevendorspecicationarenotalwaysclear.Wethereforesetouttodiscovertheac-tualprocessorbehaviourandtodenearigorousandusablesemantics,asafoundationforfuturesystembuildingandresearch.Theprogrammer-observablerelaxed-memorybehaviourofthesemultiprocessorsemergesasawhole-systemprop-ertyfromacomplexmicroarchitecture[SKT+05,LSF+07,KSSF10].Thiscanchangesignicantlybetweengenerations,e.g.fromPOWER6toPOWER7,butincludes:coresthatperformout-of-orderandspeculativeexecution,withmanyshadowregisters;hierarchicalstorebuering,withsomebueringsharedbetweenthreadsofasymmetricmulti-threading(SMT)core,andwithmultiplelevelsofcache;storebueringpartitionedbyaddress;andacacheprotocolwithmanycache-linestatesandacomplexinterconnectiontopology,andinwhichcache-lineinvalidatemessagesarebuered.Theimplementationofcoherentmemoryandofthememorybarriersinvolvesmanyofthese,workingto-gether.Tomakeausefulmodel,itisessentialtoabstractfromasmuchaspossibleofthiscomplexity,bothtomakeitsimpleenoughtobecomprehensibleandbecausethede-tailedhardwaredesignsareproprietary(thepublishedlit-eraturedoesnotdescribethemicroarchitectureinenoughdetailtocondentlypredictalltheobservablebehaviour).Ofcourse,themodelalsohastobesound,allowingallbe-haviourthatthehardwareactuallyexhibits,andsucientlystrong,capturinganyguaranteesprovidedbythehardwarethatsystemsprogrammersrelyon.Itdoesnothavetobetight:itmaybedesirabletomakealoosespecication,per-mittingsomebehaviourthatcurrenthardwaredoesnotex-hibit,butwhichprogrammersdonotrelyontheabsenceof,forsimplicityortoadmitdierentimplementationsinfuture.Themodeldoesnothavetocorrespondindetailtotheinternalstructureofthehardware:wearecapturingtheexternalbehaviourofreasonableimplementations,nottheimplementationsthemselves.Butitshouldhaveaclearab-stractionrelationshiptoimplementationmicroarchitecture,sothatthemodeltrulyexplainsthebehaviourofexamples.Todevelopourmodel,andtoestablishcondencethatitissound,wehaveconductedextensiveexperiments,run-ningseveralthousandtests,bothhand-writtenandauto-maticallygenerated,onseveralgenerationsofprocessors,forupto11iterationseach.Wepresentsomesimpletestsinx2,tointroducetherelaxedbehaviourallowedbyPower processors,andsomemoresubtleexamplesinx6,withrepre-sentativeexperimentaldatainx7.Toensurethatourmodelexplainsthebehaviouroftestsinawaythatfaithfullyab-stractsfromtheactualhardware,usingappropriatecon-cepts,wedependonextensivediscussionswithIBMsta.Tovalidatethemodelagainstexperiment,webuiltachecker,basedoncodeautomaticallygeneratedfromthemathemati-caldenition,tocalculatetheallowedoutcomesoftests(x8);thisconrmsthatthemodelgivesthecorrectresultsforalltestswedescribeandforasystematicallygeneratedfamilyofaround300others.Relaxedmemorymodelsaretypicallyexpressedeitherinanaxiomaticoranoperationalstyle.Hereweadoptthelatter,deninganabstractmachineinx3andx4.Weexpectthatthiswillbemoreintuitivethantypicalaxiomaticmodels,asithasastraightforwardnotionofglobaltime(intracesofabstractmachinetransitions),andtheabstractionfromtheactualhardwareismoredirect.Moreparticularly,toexplainsomeoftheexamples,itseemstobenecessarytomodelout-of-orderandspeculativereadsexplicitly,whichiseasiertodoinanabstract-machinesetting.Thisworkisanexerciseinmakingamodelthatisassimpleaspossiblebutnosimpler:themodelisconsiderablymorecomplexthansome(e.g.forTSOprocessorssuchasSparcandx86),butdoescapturetheprocessorbehaviourorarchitecturalintentforarangeofsubtleexamples.Moreover,whilethedenitionismathematicallyrigorous,itcanbeexplainedinonlyafewpagesofprose,soitshouldbeaccessibletotheexpertsystemsprogrammers(ofconcurrencylibraries,languageruntimes,optimisingcompilers,etc.)whohavetobeconcernedwiththeseissues.Weendwithdiscussionofrelatedwork(x9)andabriefsummaryoffuturedirections(x10),returningatlasttothevendorarchitecture.2.SimpleExamplesWebeginwithaninformalintroductiontoPowermultipro-cessorbehaviourbyexample,introducingsomekeyconceptsbutleavingexplanationintermsofthemodeltolater.2.1RelaxedbehaviourIntheabsenceofmemorybarriersordependencies,Powermultiprocessorsexhibitaveryrelaxedmemorymodel,asshownbytheirbehaviourforthefollowingfourclassictests.SB:StoreBueringHeretwothreadswritetoshared-memorylocationsandtheneachreadsfromtheotherloca-tion|anidiomattheheartofDekker'smutual-exclusionalgorithm,forexample.Inpseudocode: Thread0 Thread1 x=1 y=1 r1=y r2=x Initialsharedstate:and Allowednalstate:and Inthespeciedexecutionboththreadsreadthevaluefromtheinitialstate(inlaterexamples,thisiszerounlessoth-erwisestated).Toeliminateanyambiguityaboutexactlywhatmachineinstructionsareexecuted,eitherfromsource-languagesemanticsorcompilationconcerns,wetakethedenitiveversionofourexamplestobeinPowerPCas-sembly(availableonline[SSA+11]),ratherthanpseudocode.Theassemblycodeisnoteasytoread,however,soherewepresentexamplesasdiagramsofthememoryreadandwriteeventsinvolvedintheexecutionspeciedbytheinitialandnalstateconstraints.Inthisexample,thepseudocodeandrepresentmachineregisters,soaccessestothosearenotmemoryevents;withthenalstateasspecied,theonlyconceivableexecutionhastwowrites,labelledaandc,andtworeads,labelledbandd,withvaluesasbelow.Theyarerelatedbyprogramorderpo(laterweelideimpliedpoedges),andthefactthatthetworeadsbothreadfromtheinitialstate(0)isindicatedbytheincomingreads-from(rf)edges(fromwritestoreadsthatreadfromthem);thedotsindicatetheinitial-statewrites. TestSB:Allowed Thread0 a:W[x]=1b:R[y]=0 Thread1 c:W[y]=1d:R[x]=0popo rfrf ThisexampleillustratesthekeyrelaxationallowedinSparcorx86TSOmodels[Spa92,SSO+10].ThenextthreeshowsomewaysinwhichPowergivesaweakermodel.MP:MessagepassingHereThread0writesdataxandthensetsa\ragy,whileThread1readsyfromthat\ragwriteandthenreadsx.OnPowerthatreadisnotguaranteeedtoseetheThread0writeofx;itmightinsteadreadfrom`before'thatwrite,despitethechainofpoandrfedges: TestMP:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0po rf po rf Inrealcode,thereadcofymightbeinaloop,repeateduntilthevaluereadis1.Here,tosimplifyexperimentaltesting,wedonothavealoopbutinsteadconsideronlyexecutionsinwhichthevaluereadis1,expressedwithaconstraintonthenalregistervaluesinthetestsource.WRC:Write-to-ReadCausalityHereThread0com-municatestoThread1bywritingx=1.Thread1readsthat,andthenlater(inprogramorder)sendsamessagetoThread2bywritingintoy.HavingreadthatwriteofyatThread2,thequestioniswhetheraprogram-order-subsequentreadofxatThread2isguaranteedtoseethevaluewrittenbytheThread0write,ormightreadfrom`before'that,asshown,againdespitetherfandpochain.OnPowerthatispossible. TestWRC:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf po rf po rf IRIW:IndependentReadsofIndependentWritesHeretwothreads(0and2)writetodistinctlocationswhiletwoothers(1and3)eachreadfrombothlocations.Inthespeciedallowedexecution,theyseethetwowritesindierentorders(Thread1'srstreadseesthewritetoxbuttheprogram-order-subsequentreaddoesnotseethewriteofy,whereasThread3seesthewritetoybutnotthattox). TestIRIW:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:R[y]=0 Thread2 d:W[y]=1e:R[y]=1 Thread3 f:R[x]=0 rf po rf po rfrf CoherenceDespitealltheabove,onedoesgetaguaran-teeofcoherence:inanyexecution,foreachlocation,thereisasinglelinearorder(co)ofallwrites(byanyprocessor)tothatlocation,whichmustberespectedbyallthreads.Thefourcasesbelowillustratethis:apairofreadsbyathreadcannotreadcontrarytothecoherenceorder(CoRR1);thecoherenceordermustrespectprogramorderforapairofwritesbyathread(CoWW);areadcannotreadfromawritethatiscoherence-hiddenbyanotherwriteprogram-order-precedingtheread(CoWR),andawritecannotcoherence-order-precedeawritethataprogram-order-precedingreadreadfrom.Wecannowclarifythe`before'intheMPandWRCdiscussionabove,whichwaswithrespecttotheco-herenceorderforx. TestCoRR1:Forbidden Thread0 a:W[x]=1b:R[x]=1 Thread1 c:R[x]=0 rf po rf TestCoWW:Forbidden Thread0 b:W[x]=2a:W[x]=1 co po TestCoWR:Forbidden Thread0 a:W[x]=1b:R[x]=2 Thread1 c:W[x]=2po co rf TestCoRW:Forbidden Thread0 a:R[x]=2b:W[x]=1c:W[x]=2 Thread1 po co rf 2.2EnforcingorderingThePowerISAprovidesseveralwaystoenforcestrongerordering.Herewedealwiththesync(heavyweightsync,orhwsync)andlwsync(lightweightsync)barrierinstructions,andwithdependenciesandtheisyncinstruction,leavingload-reserve/store-conditionalpairsandeieiotofuturework.Regainingsequentialconsistency(SC)usingsyncIfoneaddsasyncbetweeneveryprogram-orderpairofinstructions(creatingtestsSB+syncs,MP+syncs,WRC+syncs,andIRIW+syncs),thenallthenon-SCresultsaboveareforbidden,e.g. TestMP+syncs:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf sync rf UsingdependenciesBarrierscanincurasignicantruntimecost,andinsomecasesenoughorderingisguaran-teedsimplybytheexistenceofadependencyfromamemoryreadtoanothermemoryaccess.Therearevariouskinds:Thereisanaddressdependency(addr)fromareadtoaprogram-order-latermemoryreadorwriteifthereisadata\rowpathfromtheread,throughregistersandarithmetic/logicaloperations(butnotthroughothermemoryaccesses),totheaddressofthesecondreadorwrite.Thereisadatadependency(data)fromareadtoamemorywriteifthereissuchapathtothevaluewritten.Addressanddatadependenciesbehavesimilarly.Thereisacontroldependency(ctrl)fromareadtoamemorywriteifthereissuchadata\rowpathtothetestofaconditionalbranchthatisaprogram-order-predecessorofthewrite.Wealsorefertocontroldepen-denciesfromareadtoaread,butorderingofthereadsinthatcaseisnotrespectedingeneral.Thereisacontrol+isyncdependency(ctrlisync)fromareadtoanothermemoryreadifthereissuchadata\rowpathfromtherstreadtothetestofaconditionalbranchthatprogram-order-precedesanisyncinstructionbeforethesecondread.Sometimesonecanusedependenciesthatarenaturallypresentinanalgorithm,butitcanbedesirabletointroduceonearticially,foritsorderingproperties,e.g.byXOR'ingavaluewithitselfandaddingthattoanaddresscalculation.Dependenciesaloneareusuallynotenough.Forexam-ple,addingdependenciesbetweenread/readandread/writepairs,givingtestsWRC+data+addr(withadatadepen-dencyonThread1andanaddressdependencyonThread2),andIRIW+addrs(withaddressdependenciesonThreads1and3),leavesthenon-SCbehaviourallowed.OnecannotadddependenciestoSB,asthatonlyhaswrite/readpairs,andonecanonlyaddadependencytotheread/readsideofMP,leavingthewritesunconstrainedandthenon-SCbe-haviourstillallowed.Incombinationwithabarrier,however,dependenciescanbeveryuseful.Forexample,MP+sync+addrisSC: TestMP+sync+addr:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf addr rf Herethebarrierkeepsthewritesinorder,asseenbyanythread,andtheaddressdependencykeepsthereadsinorder.Contrarytowhatonemightexpect,thecombinationofathread-localreads-fromedgeandadependencydoesnotguaranteeorderingofawrite-writepair,asseenbyanotherthread;thetwowritescanpropagateineitherorder(here[x]=zinitially): TestMP+nondep+sync:Allowed Thread0 a:W[x]=yb:R[x]=yc:W[y]=1d:R[y]=1 Thread1 e:R[x]=z rf addr rf sync rf Controldependencies,observablespeculativereads,andisyncRecallthatcontroldependencies(withoutisync)areonlyrespectedfromreadstowrites,notfromreadstoreads.IfonereplacestheaddressdependencyinMP+sync+addrbyadata\rowpathtoaconditionalbranchbeforethesecondread(givingthetestnamedMP+sync+ctrlbelow),thatdoesnotensurethatthereadsonThread1bindtheirvaluesinprogramorder. TestMP+sync+ctrl:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf ctrl rf Addinganisyncinstructionbetweenthebranchandthesecondread(givingtestMP+sync+ctrlisync)suces.Thefactthatdata/addressdependenciestobothreadsandwritesarerespectedwhilecontroldependenciesareonlrespectedtowritesisimportantinthedesignofC++0xlow-levelatomics[BA08,BOS+11],whererelease/consumeatomicsletonetakeadvantageofdatadependencieswithoutrequiringbarriers(andlimitingoptimisation)toensurethatallsource-languagecontroldependenciesarerespected.CumulativityForWRCitsucestohaveasynconThread1withadependencyonThread2;thenon-SCbehaviouristhenforbidden: TestWRC+sync+addr:Forbidden Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf sync rf addr rf ThisillustrateswhatwecallA-cumulativityofPowerbarri-ers:achainofedgesbeforethebarrierthatisrespected.InthiscaseThread1readsfromtheThread0writebefore(inprogramorder)executingasync,andthenThread1writestoanotherlocation;anyotherthread(here2)isguaranteedtoseetheThread0writebeforetheThread1write.How-ever,swappingthesyncanddependency,e.g.withjustanrfanddataedgebetweenwritesaandc,doesnotguaranteeorderingofthosetwowritesasseenbyanotherthread: TestWRC+data+sync:Allowed Thread0 a:W[x]=1b:R[x]=1 Thread1 c:W[y]=1d:R[y]=1 Thread2 e:R[x]=0 rf data rf sync rf IncontrasttothatWRC+data+sync,achainofreads-fromedgesanddependenciesafterasyncdoesensurethatorderingbetweenawritebeforethesyncandawriteafterthesyncisrespected,asbelow.Herethereadseandfofzandxcannotseethewritesaanddoutoforder.WecallthisaB-cumulativityproperty. TestISA2+sync+data+addr:Forbidden Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:W[z]=1e:R[z]=1 Thread2 f:R[x]=0 sync rf data rf addr rf UsinglwsyncThelwsyncbarrierisbroadlysimilartosync,includingcumulativityproperties,exceptthatdoesnotorderstore/loadpairsanditischeapertoexecute;itsuf-cestoguaranteeSCbehaviourinMP+lwsyncs(MPwithlwsyncineachthread),WRC+lwsync+addr(WRCwithlwsynconThread1andanaddressdependencyonThread2),andISA2+lwsync+data+addr,whileSB+lwsyncsandIRIW+lwsyncsarestillallowed.Wereturnlatertootherdierencesbetweensyncandlwsync.3.TheModelDesignWedescribethehigh-leveldesignofourmodelinthissec-tion,givingthedetailsinthenext.Webuildourmodelasacompositionofasetof(hardware)threadsandasinglestoragesubsystem,synchronisingonvariousmessages: Write requestRead requestBarrier request Storage SubsystemThreadThread Read-request/read-responsepairsaretightlycoupled,whiletheothersaresingleunidirectionalmessages.Thereisnobueringbetweenthetwoparts.Coherence-by-atOurstoragesubsystemabstractscompletelyfromtheprocessorimplementationstore-bueringandcachehierarchy,andfromthecacheprotocol:ourmodelhasnoexplicitmemory,eitherofthesystemasawhole,orofanycacheorstorequeue(thefactthatonecanabstractfromalltheseisitselfinteresting).Instead,weworkintermsofthewriteeventsthatareadcanreadfrom.Ourstoragesubsystemmaintains,foreachaddress,thecur-rentconstraintonthecoherenceorderamongthewritesithasseentothataddress,asastrictpartialorder(transitivebutirre\rexive).Forexample,supposethestoragesubsystemhasseenfourwrites,w0,w1,w2andw3,alltothesamead-dress.Itmighthavebuiltupthecoherenceconstraintontheleftbelow,withw0knowntobebeforew1,w2andw3,andw1knowntobebeforew2,butwithas-yet-undeterminedrelationshipsbetweenw1andw3,andbetweenw2andw3. 02310231 Thestoragesubsystemalsorecordsthelistofwritesthatithaspropagatedtoeachthread:thosesentinresponsetoread-requests,thosedonebythethreaditself,andthosepropagatedtothatthreadintheprocessofpropagatingabarriertothatthread.Theseareinterleavedwithrecordsofbarrierspropagatedtothatthread.Notethatthisisastorage-subsystem-modelconcept:thewritespropagatedtoathreadhavenotnecessarilybeensenttothethreadmodelinaread-response.Now,givenareadrequestbyathreadtid,whatwritescouldbesentinresponse?Fromthestateontheleftabove,ifthewritespropagatedtothreadtidarejust[w1],perhapsbecausetidhasreadfromw1,then:itcannotbesentw0,asw0iscoherence-beforethew1writethat(becauseitisinthewrites-propagatedlist)itmighthavereadfrom;itcouldre-readfromw1,leavingthecoherenceconstraintunchanged;itcouldbesentw2,againleavingthecoherenceconstraintunchanged,inwhichcasew2mustbeappendedtotheeventspropagatedtotid;oritcouldbesentw3,againappendingthistotheeventspropagatedtotid,whichmoreoverentailscommittingtow3beingcoherence-afterw1,asinthecoherencecon-straintontherightabove.Notethatthisstillleavestherelativeorderofw2andw3unconstrained,soanother threadcouldbesentw2thenw3or(inadierentrun)theotherwayaround(orindeedjustone,orneither).Inthemodelthisbehaviourissplitupintoseparatestorage-subsystemtransitions:thereisoneruleformakinganewco-herencecommitmentbetweentwohitherto-unrelatedwritestothesameaddress,oneruleforpropagatingawritetoanewthread(whichcanonlyreaftersucientcoherencecommitmentshavebeenmade),andoneruleforreturningareadvaluetoathreadinresponsetoaread-request.Thelastalwaysreturnsthemostrecentwrite(tothereadaddress)inthelistofeventspropagatedtothereadingthread,whichthereforeservesessentiallyasaper-threadmemory(thoughitrecordsmoreinformationthanjustanarrayofbytes).Weadopttheseseparatetransitions(inwhatwetermapartialcoherencecommitmentstyle)tomakeiteasytorelatemodeltransitionstoactualhardwareimplementationevents:co-herencecommitmentscorrespondtowrites\rowingthroughjoinpointsinahierarchical-store-buerimplementation.Out-of-orderandSpeculativeExecutionAsweshallseeinx6,manyoftheobservablesubtletiesofPowerbe-haviourarisefromthefactthatthethreadscanperformreadinstructionsout-of-orderandspeculatively,subjecttoaninstructionbeingrestartedifacon\rictingwritecomesinbeforetheinstructioniscommitted,orbeingabortedifabranchturnsouttobemispredicted.However,writesarenotsenttothestoragesubsystembeforetheirinstructionsarecommitted,andwedonotseeobservablevaluespeculation.Accordingly,ourthreadmodelpermitsveryliberalout-of-orderexecution,withunboundedspeculativeexecutionpastas-yet-unresolvedbranchesandpastsomebarriers,whileourstoragesubsystemmodelneednotbeconcernedwithspec-ulation,retry,oraborts.Ontheotherhand,thestoragesubsystemmaintainsthecurrentcoherenceconstraint,asabove,whilethethreadmodeldoesnotneedtohaveac-cesstothis;thethreadmodelplaysitspartinmaintainingcoherencebyissuingrequestsinreasonableorders,andinaborting/retryingasnecessary.Foreachthreadwehaveatreeofthecommittedandin-\rightinstructioninstances.Newlyfetchedinstructionsbecomein-\right,andlater,subjecttoappropriateprecondi-tions,canbecommitted.Forexample,belowweshowasetofinstructioninstancesi1;:::;i13withtheprogram-order-successorrelationamongthem.Threeofthose(i1;i3;i4,boxed)havebeencommitted;theremainderarein-\right. i1i2i3i4i5i6i8i7i9i10i13i11i12 Instructioninstancesi5andi9arebranchesforwhichthethreadhasfetchedmultiplepossiblesuccessors;herejusttwo,butforabranchwithacomputedaddressitmightfetchmanypossiblesuccessors.Notethatthecommittedinstancesarenotnecessarilycontiguous:herei3andi4havebeencommittedeventhoughi2hasnot,whichcanonlyhappeniftheyaresucientlyindependent.Whenabranchiscommittedthenanyun-takenalternativepathsarediscarded,andinstructionsthatfollow(inprogramorder)anuncommittedbranchcannotbecommitteduntilthatbranchis,sothetreemustbelinearbeforeanycommitted(boxed)instructions.Inimplementations,readsareretriedwhencache-lineinvalidatesareprocessed.Inthemodel,toabstractfromexactlywhenthishappens(andfromwhatevertrackingthecoredoesofwhichinstructionsmustberetriedwhenitdoes),weadoptalateinvalidatesemantics,retryingappropriatereads(andtheirdependencies)whenareadorwriteiscommitted.Forexample,considertworeads1and2inprogramorderthathavebeensatisedfromtwodierentwritestothesameaddress,with2satisedrst(out-of-order),fromw1,and1satisedlaterfromthecoherence-laterw2.When1iscommitted,2mustberestarted,otherwisetherewouldbeacoherenceviolation.(Thisreliesonthefactthatwritesareneverprovidedbythestoragesubsystemoutofcoherenceorder;thethreadmodeldoesnotneedtorecordthecoherenceorderexplicitly.)Dependenciesaredealtwithentirelybythethreadmodel,intermsoftheregistersreadandwrittenbyeachinstructioninstance(theregisterfootprintsofinstructionsareknownstatically).Memoryreadscannotbesatiseduntiltheiraddressesaredetermined(thoughperhapsstillsubjecttochangeonretry),andmemorywritescannotbecommitteduntiltheiraddressesandvaluesarefullydeter-mined.Wedonotmodelregisterrenamingandshadowregis-tersexplicitly,butourout-of-orderexecutionmodelincludestheireect,asregisterreadstakevaluesfromprogram-order-precedingregisterwrites.Barriers(syncandlwsync)andcumulativity-by-atThesemanticsofbarriersinvolvesbothpartsofthemodel,asfollows.Whenthestoragesubsystemreceivesabarrierrequest,itrecordsthebarrieraspropagatedtoitsownthread,markingapointinthesequenceofwritesthathavebeenpropagatedtothatthread.ThosewritesaretheGroupAwritesforthisbarrier.WhenalltheGroupAwrites(orsomecoherence-successorsthereof)ofabarrierhavebeenpropagatedtoan-otherthread,thestoragesubsystemcanrecordthatfactalso,propagatingthebarriertothatthread(therebymark-ingapointinthesequenceofwritesthathavebeenprop-agatedtothatthread).Awritecannotbepropagatedtoathreadtiduntilallrelevantbarriersarepropagatedtotid,wheretherelevantbarriersarethosethatwerepropagatedtothewritingthreadbeforethewriteitself.Inturn(bytheabove),thatmeansthattheGroupAwritesofthosebarri-ers(orsomecoherencesuccessors)mustalreadyhavebeenpropagatedtotid.Thismodelstheeectofcumulativitywhileabstractingfromthedetailsofhowitisimplemented.Moreover,asyncbarriercanbeacknowledgedbacktotheoriginatingthreadwhenallofitsGroupAwriteshavebeenpropagatedtoallthreads.Inthethreadmodel,barriersconstrainthecommitorder.Forexample,nomemoryloadorstoreinstructioncanbecommitteduntilallprevioussyncbarriersarecommittedandacknowledged;andsyncandlwsyncbarrierscannotbecommitteduntilallpreviousmemoryreadsandwriteshavebeen.Moreover,memoryreadscannotbesatiseduntilprevioussyncbarriersarecommittedandacknowledged.Therearevariouspossiblemodellingchoicesherewhichshouldnotmakeanyobservabledierence|theabovecorrespondstoamoderatelyaggressiveimplementation.4.TheModelinDetailWenowdetailtheinterfacebetweenthestoragesubsystemandthreadmodels,andthestatesandtransitionsofeach.Thetransitionsaredescribedinx4.3andx4.5intermsof theirprecondition,theireectontherelevantstate,andthemessagessentorreceived.Transitionsareatomic,andsynchroniseasshowninFig.1;messagesarenotbuered.Thisisaprosedescriptionofourmathematicaldenitions,availableon-line[SSA+11].4.1TheStorageSubsystem/ThreadInterfaceThestoragesubsystemandthreadsexchangemessages:awriterequest(orwrite)wspeciesthewritingthreadtid,uniquerequestideid,addressa,andvaluev.areadrequestspeciestheoriginatingthreadtid,requestideid,andaddressa.areadresponsespeciestheoriginatingthreadtid,re-questideid,andawritew(itselfspecifyingthethreadtid0thatdidthewrite,itsideid0,addressa,andvaluev).Thisissentwhenthevaluereadisbound.abarrierrequestspeciestheoriginatingthreadtid,requestideid,andbarriertypeb(syncorlwsync).abarrierackspeciestheoriginatingthreadtidandrequestideid(abarrierackisonlysentforsyncbarriers,afterthebarrierispropagatedtoallthreads.4.2StorageSubsystemStatesAstoragesubsystemstateshasthefollowingcomponents.s:threadsisthesetofthreadidsthatexistinthesystem.s:writes seenisthesetofallwritesthatthestoragesubsystemhasseen.s:coherenceisthecurrentconstraintonthecoherenceorder,amongthewritesthatthestoragesubsystemhasseen.Itisabinaryrelation:s:coherencecontainsthepair(w1;w2)ifthestoragesubsystemhascommittedtowritew1beingbeforewritew2inthecoherenceorder.Thisrelationgrowsovertime,withnewpairsbeingadded,asthestoragesubsystemmakesadditionalcommitments.Foreachaddress,s:coherenceisastrictpartialorderoverthewriterequestsseentothataddress.Itdoesnotrelatewritestodierentaddresses,orrelateanywritethathasnotbeenseenbythestoragesubsystemtoanywrite.s:events propagated togives,foreachthread,alistof:1.allwritesdonebythatthreaditself,2.allwritesbyotherthreadsthathavebeenpropagatedtothisthread,and3.allbarriersthathavebeenpropagatedtothisthread.Werefertothosewritesasthewritesthathavebeenpropagatedtothatthread.TheGroupAwritesforabarrierareallthewritesthathavebeenpropagatedtothebarrier'sthreadbeforethebarrieris.s:unacknowledged sync requestsisthesetofsyncbar-rierrequeststhatthestoragesubsystemhasnotyetac-knowedged.Aninitialstateforthestoragesubsystemhasthesetofthreadidsthatexistinthesystem,exactlyonewriteforeachmemoryaddress,allofwhichhavebeenpropagatedtoallthreads(thisensuresthattheywillbecoherence-beforeanyotherwritetothataddress),anemptycoherenceorder,andnounacknowledgedsyncrequests.4.3StorageSubsystemTransitionsAcceptwriterequestAwriterequestbyathreadtidcanalwaysbeaccepted.Action:1.addthenewwritetos:writes seen,torecordthenewwriteasseenbythestoragesubsystem;2.appendthenewwritetos:events propagated to(tid),torecordthenewwriteaspropagatedtoitsownthread;and3.updates:coherencetonotethatthenewwriteiscoherence-afterallwrites(tothesameaddress)thathavepreviouslypropagatedtothisthread.PartialcoherencecommitmentThestoragesubsys-temcaninternallycommittoamoreconstrainedcoherenceorderforaparticularaddress,addinganarbitraryedge(be-tweenapairofwritestothataddressthathavebeenseenalreadythatarenotyetrelatedbycoherence)tos:coherence,togetherwithanyedgesimpliedbytransitivity,ifthereisnocycleintheunionoftheresultingcoherenceorderandthesetofallpairsofwrites(w1;w2),toanyaddress,forwhichw1andw2areseparatedbyabarrierinthelistofeventspropagatedtothethreadofw2.Action:Addthenewedgestos:coherence.PropagatewritetoanotherthreadThestoragesub-systemcanpropagateawritew(bythreadtid)thatithasseentoanotherthreadtid0,if:1.thewritehasnotyetbeenpropagatedtotid0;2.wiscoherence-afteranywritetothesameaddressthathasalreadybeenpropagatedtotid0;and3.allbarriersthatwerepropagatedtotidbeforew(ins:events propagated to(tid))havealreadybeenpropa-gatedtotid0.Action:appendwtos:events propagated to(tid0).SendareadresponsetoathreadThestoragesubsys-temcanacceptaread-requestbyathreadtidatanytime,andreplywiththemostrecentwritewtothesameaddressthathasbeenpropagatedtotid.Therequestandresponsearetightlycoupledintooneatomictransition.Action:sendaread-responsemessagecontainingwtotid.AcceptbarrierrequestAbarrierrequestfromathreadtidcanalwaysbeaccepted.Action:1.appendittos:events propagated to(tid),torecordthebarrieraspropagatedtoitsownthread(andtherebyxthesetofGroupAwritesforthisbarrier);and2.(forsync)addittos:unacknowledged sync requests.PropagatebarriertoanotherthreadThestoragesubsystemcanpropagateabarrierithasseentoanotherthreadif:1.thebarrierhasnotyetbeenpropagatedtothatthread;and2.foreachGroupAwrite,thatwrite(orsomecoherencesuccessor)hasalreadybeenpropagatedtothatthreadAction:appendthebarriertos:events propagated to(tid).AcknowledgesyncbarrierAsyncbarrierbcanbeac-knowledgedifithasbeenpropagatedtoallthreads.Action:1.sendabarrier-ackmessagetotheoriginatingthread;and2.removebfroms:unacknowledged sync requests. StorageSubsystemRule Message(s) ThreadRule Acceptwriterequest writerequest Commitin-\rightinstruction Partialcoherencecommitment Propagatewritetoanotherthread Sendareadresponsetoathread readrequest/readresponse Satisfymemoryreadfromstoragesubsystem Satisfymemoryreadbyforwardinganin-\rightwrite Acceptbarrierrequest barrierrequest Commitin-\rightinstruction Propagatebarriertoanotherthread Acknowledgesyncbarrier barrierack Acceptsyncbarrieracknowledgement Fetchinstruction Registerreadfrompreviousregisterwrite Registerreadfrominitialregisterstate Internalcomputationstep Figure1.StorageSubsystemandThreadSynchronisation4.4ThreadStatesThestatetofasinglehardwarethreadconsistsof:itsthreadid.theinitialvaluesforallregisters,t:initial register state.asett:committed instructionsofcommittedinstructioninstances.Alltheiroperationshavebeenexecutedandtheyarenotsubjecttorestartorabort.asett:in \right instructionsofin-\rightinstructionin-stances.Thesehavebeenfetchedandsomeoftheasso-ciatedinstruction-semanticsmicro-operationsmayhavebeenexecuted.However,noneoftheassociatedwritesorbarriershavebeensenttothestoragesubsystem,andanyin-\rightinstructionissubjecttobeingaborted(togetherwithallofitsdependents).asett:unacknowledged syncsofsyncbarriersthathavenotbeenacknowledgedbythestoragesubsystem.Aninitialstateforathreadhasnocommittedorin-\rightinstructionsandnounacknowledgedsyncbarriers.Eachinstructioninstanceiconsistsofauniqueid,arep-resentationofthecurrentstateofitsinstructionseman-tics,thenamesofitsinputandoutputregisters,thesetofwritesthatithasreadfrom,theinstructionaddress,theprogram-order-previousinstructioninstanceid,andanvalueconstraintrequiredtoreachthisinstructioninstancefromthepreviousinstance.Theinstructionsemanticsex-ecutesinsteps,doinginternalcomputation,registerreadsandwrites,memoryreads,and,nally,memorywritesorbarriers.4.5ThreadTransitionsFetchinstructionAninstructioninstcanbefetched,fol-lowingitsprogram-orderpredecessoriprevandfromaddressa,if1.aisapossiblenextfetchaddressforiprev;and2.instistheinstructionoftheprogramata.Thepossiblenextfetchaddressesallowspeculationpastcalculatedjumpsandconditionalbranches;theyaredenedas:1.foranon-branch/jumpinstruction,thesuccessorinstruc-tionaddress;2.forajumptoaconstantaddress,thataddress;3.forajumptoanaddresswhichisnotyetfullydetermined(i.e.,wherethereisanuncommittedinstructionwithadata\rowpathtotheaddress),anyaddress;and4.foraconditionalbranch,thepossibleaddressesforajumptogetherwiththesuccessor.Action:constructaninitializedinstructioninstanceandaddittothesetofin-\rightinstructions.Thisisaninternalactionofthethread,notinvolvingthestoragesubsystem,asweassumeaxedprogramratherthanmodellingfetcheswithreads;wedonotmodelself-modifyingcode.Commitin-\rightinstructionAnin-\rightinstructioncanbecommittedif:1.itsinstructionsemanticshasnopendingreads(memoryorregister)orinternalcomputation(dataoraddressarithmetic);2.allinstructionswithadata\rowdependencytothisin-struction(instructionswithregisteroutputsfeedingtothisinstruction'sregisterinputs)arecommitted;3.allprogram-order-previousbranchesarecommitted;4.ifamemoryloadorstoreisinvolved,allprogram-order-previousinstructionswhichmightaccessitsaddress(i.e.,whichhaveanas-yet-undeterminedaddressorwhichhaveadeterminedaddresswhichequalsthatone)arecommitted;5.ifamemoryloadorstoreisinvolved,orthisinstructionisasync,lwsync,orisync,then(a)allprevioussync,lwsyncandisyncinstructionsarecommitted,and(b)thereisnounacknowledgedsyncbarrierbythisthread;6.ifasyncorlwsyncinstruction,allpreviousmemoryaccessinstructionsarecommitted;7.ifanisync,thenallprogram-order-previousinstructionswhichaccessmemoryhavetheiraddressesfullydeter-mined,whereby`fullydetermined'wemeanthatallin-structionsthatarethesourceofincomingdata\rowde-pendenciestotherelevantaddressarecommittedandanyinternaladdresscomputationisdone.Action:notethattheinstructionisnowcommitted,and:1.ifawriteinstruction,restartanyin-\rightmemoryreads(andtheirdata\rowdependents)thathavereadfromthesameaddress,butfromadierentwrite(andwherethereadcouldnothavebeenbyforwardinganin-\rightwrite);2.ifareadinstruction,ndallin-\rightprogram-ordersuccessorsthathavereadfromadierentwritetothesameaddress,orwhichfollowalwsyncbarrierprogram-orderafterthisinstruction,andrestartthemandtheirdata\rowdependents; 3.ifthisisabranch,abortanyuntakenspeculativepathsofexecution,i.e.,anyinstructioninstancesthatarenotreachablebythebranchtaken;and4.sendanywriterequestsorbarrierrequestsasrequiredbytheinstructionsemantics.AcceptsyncbarrieracknowledgementAsyncbar-rieracknowledgementcanalwaysbeaccepted(therewillalwaysbeacommittedsyncwhosebarrierhasamatch-ingeid).Action:removethecorrespondingbarrierfromt:unacknowledged syncs.SatisfymemoryreadfromstoragesubsystemApendingreadrequestintheinstructionsemanticsofanin-\rightinstructioncanbesatisedbymakingaread-requestandgettingaread-responsecontainingawritefromthestoragesubsystemif:1.theaddresstoreadisdetermined(i.e.,anyotherreadswithadata\rowpathtotheaddresshavebeensatised,thoughnotnecessarilycommitted,andanyarithmeticonsuchapathcompleted);2.allprogram-order-previoussyncsarecommittedandac-knowledged;and3.allprogram-order-previousisyncsarecommitted.Action:1.updatetheinternalstateofthereadinginstruction;and2.notethatthewritehasbeenreadfrombythatinstruc-tion.Theremainingtransitionsareallthread-internalsteps.Satisfymemoryreadbyforwardinganin-\rightwritedirectlytoreadinginstructionApendingmemorywritewfromanin-\right(uncommitted)instructioncanbeforwardeddirectlytoareadofaninstructioniif1.wisanuncommittedwritetothesameaddressthatisprogram-orderbeforetheread,andthereisnoprogram-order-interveningmemorywritethatmightbetothesameaddress;2.alli'sprogram-order-previoussyncsarecommittedandacknowledged;and3.alli'sprogram-order-previousisyncsarecommitted.Action:asinthesatisfymemoryreadfromstoragesubsystemruleabove.RegisterreadfrompreviousregisterwriteAregisterreadcanreadfromaprogram-order-previousregisterwriteifthelatteristhelastwritetothesameregisterprogram-orderbeforeit.Action:updatetheinternalstateofthein-\rightreadinginstruction.RegisterreadfrominitialregisterstateAregisterreadcanreadfromtheinitialregisterstateifthereisnowritetothesameregisterprogram-orderbeforeit.Action:updatetheinternalstateofthein-\rightreadinginstruction.InternalcomputationstepAnin-\rightinstructioncanperformaninternalcomputationstepifitssemanticshasapendinginternaltransition,e.g.foranarithmeticoperation.Action:updatetheinternalstateofthein-\rightinstruction.4.6FinalstatesThenalstatesarethosewithnotransitions.Itshouldbethecasethatforallsuch,allinstructioninstancesarecommitted.5.ExplainingthesimpleexamplesTheabstractmachineexplainstheallowedandforbiddenbehaviourforallthesimpletestswesawbefore.Forexample,inoutline:MPTheThread0write-requestsforxandycouldbein-orderornot,buteitherway,becausetheyaretodierentaddresses,theycanbepropagatedtoThread1ineitherorder.Moreover,eveniftheyarepropagatedinprogramorder,theThread1readofxcanbesatisedrst(seeingtheinitialstate),thenthereadofy,andtheycouldbecommittedineitherorder.MP+sync+ctrl(controldependency)HerethesynckeepsthepropagationofthewritestoThread1inorder,buttheThread1readofxcanbesatisedspeculatively,beforetheconditionalbranchofthecontroldependencyisresolvedandbeforetheprogram-order-precedingThread1readofyissatised;thenthetworeadscanbecommittedinprogramorder.MP+sync+ctrlisync(isync)AddingisyncbetweentheconditionalbranchandtheThread1readofxpreventsthatreadbeingsatiseduntiltheisynciscommitted,whichcannothappenuntiltheprogram-order-previousbranchiscommitted,whichcannothappenuntiltherstreadissatisedandcommitted.WRC+sync+addr(A-cumulativity)TheThread0write-requestfora:W[x]=1mustbemade,andthewritepropagatedtoThread1,forbtoread1fromit.Thread1thenmakesabarrierrequestforitssync,andthatisprop-agatedtoThread1aftera(sothewriteaisintheGroupAsetforthisbarrier),beforemakingthewrite-requestforc:W[y]=1.ThatwritemustbepropagatedtoThread2fordtoreadfromit,butbeforethatispossiblethesyncmustbepropagatedtoThread2,andbeforethatispossibleamustbepropagatedtoThread2.Meanwhile,thedependencyonThread2meansthattheaddressofthereadeisnotknown,andsoecannotbesatised,untilreaddhasbeensatised(fromc).AsthatcannotbeuntilafteraispropagatedtoThread2,reademustread1froma,not0fromtheinitialstate.WRC+data+syncHere,incontrast,whiletheThread0/Thread1reads-fromrelationshipandtheThread1dependencyensurethatthewrite-requestsfora:W[x]=1andc:W[y]=1aremadeinthatorder,andtheThread2synckeepsitsreadsinorder,theorderthatthewritesarepropagatedtoThread2isunconstrained.ISA2(B-cumulativity)IntheISA2+sync+data+addrB-cumulativityexample,theThread0writerequestsandbarrierrequestmustreachthestoragesubsysteminprogramorder,soGroupAforthesyncisaandthesyncispropagatedtoThread0beforethebwriterequestreachesthestoragesubsystem.Forctoreadfromb,thelattermusthavebeenpropagatedtoThread1,whichrequiresthesynctobepropagatedtoThread1,whichinturnrequirestheGroupAwriteatohavebeenpropagatedtoThread1.Now,theThread1dependencymeansthatdcannotbecommittedbeforethereadcissatised(andcommitted),andhencedmustbeafterthesyncispropagatedtoThread1.Finally,foretoreadfromd,thelattermusthavebeenpropagatedtoThread2,forwhichthesyncmustbepropagatedtoThread2,andhencetheGroupAwriteapropagatedtoThread2.TheThread2dependencymeansthatfcannot besatiseduntileis,soitmustreadfroma,notfromtheinitialstate.Thesameresultandreasoningholdforthelwsyncvariantofthistest(notethatthereasoningdidnotinvolvesyncacksoranymemoryreadsprogram-order-afterthesync).IRIW+syncsHerethetwosyncs(onThreads1and3)havethecorrespondingwrites(aandd)intheirGroupAsets,andhencethosewritesmustbepropagatedtoallthreadsbeforetherespectivesyncsareacknowledged,whichmusthappenbeforetheprogram-order-subsequentreadscandfcanbebesatised.Butforthosetoread0,fromcoherence-predecessorsofaandd,thelattermustnothavebeenpropagatedtoallthreads(inparticular,theymustnothavebeenpropagatedtoThreads3and1respectively).Inotherwords,forthistohappentherewouldhavetobeacycleinabstract-machineexecutiontime: Thread1syncacknowledgementThread1c:R[y]=0issatiseda:W[x]=1propagatedtolastthreadd:W[y]=1propagatedtolastthreadThread3syncacknowledgementThread3d:R[x]=0issatised Withlwsyncsinsteadofsyncs,thebehaviourisallowed,becauselwsyncsdonothaveananalogousacknowledgementwhentheirGroupAwriteshavebeenpropagatedtoallthreads,andmemoryreadsdonotwaitforpreviouslwsyncstoreachthatpoint.6.Not-so-simpleexamplesWenowdiscusssomemoresubtlebehaviours,explainingeachintermsofourmodel.WriteforwardingInthePPOCAvariantofMPbelow,fisaddress-dependentone,whichreadsfromthewrited,whichiscontrol-dependentonc.Onemightexpectthatchaintopreventreadfbindingitsvalue(withthesatisfymemoryreadfromstoragesubsystemrule)beforecdoes,butinfactinsomeimplementationsfcanbindout-of-order,asshown|thewritedcanbeforwardeddirectlytoewithinthethread,beforethewriteiscommittedtothestoragesubsystem.Thesatisfymemoryreadbyforwardinganin-\rightwriterulemodelsthis.Replacingthecontroldependencywithadatadependency(testPPOAA,notshown)removesthatpossibility,forbiddingthegivenresultoncurrenthardware,asfarasourexperimentalresultsshow,andinourmodel.Thecurrentarchitecturetext[Pow09]leavesthePPOAAoutcomeunspecied,butweanticipatethatfutureversionswillexplicitlyforbidit. TestPPOCA:Allowed Thread0 a:W[z]=1b:W[y]=1c:R[y]=1 Thread1 d:W[x]=1e:R[x]=1f:R[z]=0 sync rf ctrl rf addr rf TestRSW:Allowed Thread0 a:W[z]=1b:W[y]=2c:R[y]=2 Thread1 d:R[x]=0e:R[x]=0f:R[z]=0 sync rf addr po addr rfrfrf Aggressivelyout-of-orderreadsInthereads-from-same-writes(RSW)variantofMPabove,thetworeadsofx,dande,happentoreadfromthesamewrite(theinitialstate).Inthiscase,despitethefactthatdandearereadingfromthesameaddress,thee/fpaircansatisfytheirreadsout-of-order,beforethec/dpair,permittingtheoutcomeshown.Theaddressofeisknown,soitcanbesatisedearly,whiletheaddressofdisnotknownuntilitsaddressdependencyoncisresolved.Incontrast,inanexecutioninwhichdandereadfromdierentwritestox(testRDW,notshown),withanotherwritetoxbyanotherthread),thatisforbidden|inthemodel,thecommitoftherstread(d)wouldforcearestartofthesecond(e),togetherwithitsdependencies(includingf),ifehadinitiallyreadfromadierentwritetod.Inactualimplementationstherestartmightbeearlier,whenaninvalidateisprocessed,butwillhavethesameobservableeect.Coherenceandlwsync:blw-w-006Thisexampleshowsthatonecannotassumethatthetransitiveclosureoflwsyncandcoherenceedgesguaranteesorderingofwritepairs,whichisachallengeforover-simpliedmodels.Inourabstractmachine,thefactthatthestoragesubsystemcom-mitstobbeingbeforecinthecoherenceorderhasnoeectontheorderinwhichwritesaanddpropagatetoThread2.Thread1doesnotreadfromeitherThread0write,sotheyneednotbesenttoThread1,sonocumulativityisinplay. Testblw-w-006:Allowed Thread0 a:W[x]=1b:W[y]=1c:W[y]=2 Thread1 d:W[z]=1e:R[z]=1 Thread2 f:R[x]=0 lwsync co lwsync rf addr rf Insomeimplementations,andinthemodel,replacingbothlwsyncsbysyncs(bsync-w-006)forbidsthisbehaviour.Inthemodel,itwouldrequireacycleinabstract-machineexecutiontime,fromthepointatwhichapropagatestoitslastthread,totheThread0syncack,tothebwriteaccept,tocpropagatingtoThread0,tocpropagatingtoitslastthread,totheThread1syncack,tothedwriteaccept,todpropagatingtoThread2,toebeingsatised,tofbeingsatised,toapropagatingtoThread2,toapropagatingtoitslastthread.Thecurrentarchitecturetextagainleavesthisunspecied,butonewouldexpectthataddingsynceverywhere(or,inthiscase,anaddressdependencybetweentworeads)shouldregainSC.Coherenceandlwsync:2+2WandR01The2+2W+lwsyncsexamplebelowisacasewheretheinterac-tionofcoherenceandlwsyncsdoesforbidsomebehaviour.Withoutthelwsyncs(2+2W),thegivenexecutionisal-lowed.Withthem,thewritesmustbecommittedinprogramorder,butafteronepartialcoherencecommitment(saydbe-forea)isdone,theother(bbeforec)isnolongerpermitted.(Asthistesthasonlywrites,itmaybehelpfultonotethatthecoherenceorderedgesherecouldbeobservedeitherbyreadingthenalstateorwithadditionalthreadsreadingxtwiceandytwice.Testingbothversionsgivesthesamere-sult.)Thisexampleisachallengeforaxiomaticmodelswithavieworderperthread,assomethingisneededtobreakthesymmetry.Thegivenbehaviourisalsoforbiddenfortheversionwithsyncs(2+2W+syncs). Test2+2W+lwsyncs:Forbidden Thread0 a:W[x]=1b:W[y]=2c:W[y]=1 Thread1 d:W[x]=2 lwsynclwsync coco TestR01:Allowed Thread0 a:W[x]=1b:W[y]=1c:W[y]=2 Thread1 d:R[x]=0 lwsync co sync rf TheR01testontherightaboveisarelatedcasewherewehavenotobservedthegivenallowedbehaviourinpractice,butitisnotcurrentlyforbiddenbythearchitecture,andourmodelpermitsit.Inthemodel,thewritescanallreachthestoragesubsystem,theb/cpartialcoherencecommitmentbemade,cbepropagatedtoThread0,thesyncbecommittedandacknowledged,anddbesatised,allbeforeaandthelwsyncpropagatetoThread1.LBand(no)thin-airreadsThisLBdualoftheSBexampleisanothercasewherewehavenotobservedthegivenallowedbehaviourinpractice,butitisclearlyarchi-tecturallyintended,soprogrammersshouldassumethatfu-tureprocessorsmightpermitit,andourmodeldoes.Addingdataoraddressdependencies(e.g.inLB+datas)shouldfor-bidthegivenbehaviour(thedatadependencycasecouldinvolveout-of-thin-airreads),butexperimentaltestinghereisvacuous,asLBitselfisnotobserved. TestLB:Allowed Thread0 a:R[x]=1b:W[y]=1c:R[y]=1 Thread1 d:W[x]=1popo rfrf RegistershadowingAdiretal.[AAS03]giveanothervariantofLB(whichwecallLB+rs,forRegisterShadow-ing),withadependencyonThread1butre-usingthesameregisteronThread0,todemonstratetheobservabilityofshadowregisters.Thatisalsoallowedinourmodelbutnotobservableinourtests|unsurprisingly,giventhatwehavenotobservedLBitself.However,thefollowingvariantofMPdoesexhibitobservableregistershadowing:thetwousesofr3onThread1donotpreventthesecondreadbeingsatis-edout-of-order,ifthereadsareintoshadowregisters.Thereuseofaregisterisnotrepresentedinourdiagrams,soforthisexamplewehavetogivetheunderlyingPowerPCassemblycode. TestMP+sync+rs:Allowed Thread0 a:W[x]=1b:W[y]=1c:R[y]=1 Thread1 d:R[x]=0 sync rf po rf Thread0 Thread1 lir1,1 lwzr3,0(r2) stwr1,0(r2) mrr1,r3 sync lwzr3,0(r4) lir3,1 stwr3,0(r4) Initialstate:0:r2=x^0:r4=y ^1:r2=y^1:r4=x Allowed:1:r1=1^1:r3=0 7.ExperimentsonhardwareTheuseofsmalllitmus-testprogramsfordiscussingthebe-haviourofrelaxedmemorymodelsiswell-established,butmostpreviouswork(withnotableexceptions)doesnotem-piricallyinvestigatehowtestsbehaveonactualhardware.Weuseourtool[AMSS11]toruntestsonmachineswithvariousPowerprocessors:PowerG5(akaPowerPC970MP,basedonaPOWER4core),POWER5,POWER6,andPOWER7.ThetooltakestestsinPowerPCassem-blyandrunstheminatestharnessdesignedtostresstheprocessor,toincreasethelikelihoodofinterestingresults.Thisisblack-boxtesting,andonecannotmakeanydeniteconclusionsfromtheabsenceofsomeobservation,butourexperienceisthatthetoolisratherdiscriminating,identify-ingmanyissueswithpreviousmodels(and[AMSS10]reportthediscoveryofaprocessorerratumusingit).Ourworkisalsounusualintherangeandnumberoftestsused.Forthispaperwehaveestablishedalibrarybasedontestsfromtheliterature[Pow09,BA08,AAS03,ARM08],newhand-writtentests(e.g.thePPOCA,PPOAA,RSW,RDW,and2+2Winx6,andmanyothers),andsystematicvariationsofseveraltests(SB,MP,WRC,IRIW,ISA2,LB,andtwoothers,RWCandWWC)withallpossiblecombi-nationsofbarriersordependencies;wecallthisthe\VAR3"family,of314tests.WeranalloftheseonPowerG5,6,and7.Inaddition,weusethetool[AMSS10]tosys-tematicallygenerateseveralthousandinterestingtestswithcyclesofedges(dependencies,reads-from,coherence,etc.)ofincreasingsize,andtestedsomeofthese.Asanimpor-tantstylepoint,weusetestswithconstraintsonthenalvalues(andhenceonthevaluesread)ratherthanloops,tomakethemeasilytestable.Wegiveanexcerptofourexper-imentalresultsbelow,togivethe\ravour;moreareavail-ableonline[SSA+11].Forexample,PPOCAwasobservableonPOWERG5(1.0k/3.1G),notobservableonPOWER6,andthenobservableagainonPOWER7|consistentwiththelessaggressivelyout-of-ordermicroarchitectureofPOWER6. Test Model POWER6 POWER7 WRC Allow ok970k/12G ok23M/93G WRC+data+addr Allow ok562k/12G ok94k/93G WRC+syncs Forbid ok0/16G ok0/110G WRC+sync+addr Forbid ok0/16G ok0/110G WRC+lwsync+addr Forbid ok0/16G ok0/110G WRC+data+sync Allow ok150k/12G ok56k/94G PPOCA Allow unseen0/39G ok62k/141G PPOAA Forbid ok0/39G ok0/157G LB Allow unseen0/31G unseen0/176G Theinterplaybetweenmanualtesting,systematictest-ing,anddiscussionwithIBMstahasbeenessentialtothedevelopmentofourmodel.Forexample:thePPOCA/PPOAAbehaviourwasdiscoveredinmanualtest-ing,leadingustoconjecturethatitshouldbeexplainedbywrite-forwarding,whichwaslaterconrmedbydiscussion;theblw-w-006test,foundinsystematictesting,highlighteddicultieswithcoherenceandlwsyncinanearliermodel;andtheroleofcoherenceandsyncacknowledgementsinthecurrentimplementationsarosefromdiscussion.8.ExecutingthemodelTheintricacyofrelaxedmemorymodels(andthenumberoftestsweconsider)makeitessentialalsotohavetoolsup-portforexploringthemodel,toautomaticallycalculatetheoutcomesthatthemodelpermitsforalitmustest,andtocomparethemagainstthoseobservedinpractice.Toeasemodeldevelopment,andalsotoretaincondenceinthetool,itskernelshouldbeautomaticallygeneratedfromthemodeldenition,nothand-coded.OurabstractmachineisdenedinLem,anewlightweightlanguageformachine-formalisedmathematicaldenitions,oftypes,functionsandinductiverelations[OBZNS].FromthiswegenerateHOL4provercode(andthenceanautomaticallytypesetversionofthemachine)andexecutableOCamlcode,usinganitesetlibrary,forthepreconditionandactionofeachtransitionrule.Wealsoformalisedasymbolicoperationalsemanticsforthetinyfragmentoftheinstructionsetneededforourtests.Usingthose,webuildanexhaustivememoisedsearchproce-durethatndsallpossibleabstract-machineexecutionsforlitmustests.Thishasconrmedthatthemodelhastheexpectedbehaviourforthe41testswementionbynameinthispaper,fortherestoftheVAR3familyof314systematictests, andforvariousothertests.InmostcasesthemodelexactlymatchesthePower7experimentalresults,withtheexceptionofafewwhereitincludestheexperimentaloutcomesbutisintentionallylooser;thisappliesto60testsoutofourbatchof333.Specically:themodelallowsinstructionstocommitoutofprogramorder,whichpermitstheLBandLB+rstestoutcomes(notobservedinpractice);themodelalsoallowsanisynctocommiteveninthepresenceofpreviouslyuncommittedmemoryaccesses,whereasthespeciedoutcomesoftestssuchasWRCwithanlwsyncandisynchavenotbeenobserved;andtheR01testoutcomeisnotobserved.Inallthesecasesthemodelfollowsthearchitecturalintent,asconrmedwithIBMsta.OurexperimentalresultsalsoconrmthatPowerG5and6arestrictlystrongerthanPower7(thoughindierentways):wehavenotseenanytestoutcomeonthosewhichisnotalsoobservableonPower7.Themodelisthusalsosoundforthose,tothebestofourknowledge.9.RelatedWorkTherehasbeenextensivepreviousworkonrelaxedmem-orymodels.Wefocusonmodelsforthemajorcurrentpro-cessorfamiliesthatdonothavesequentiallyconsistentbe-haviour:Sparc,x86,Itanium,ARM,andPower.EarlyworkbyCollier[Col92]developedmodelsbasedonempiricaltest-ingforthemultiprocessorsoftheday.ForSparc,theven-dordocumentationhasaclearTotalStoreOrdering(TSO)model[SFC91,Spa92].ItalsointroducesPSOandRMOmodels,butthesearenotusedinpractice.Forx86,thevendorintentionswereuntilrecentlyquiteunclear,aswasthebehaviourofprocessorimplementations.TheworkbySarkar,Owens,etal.[SSZN+09,OSS09,SSO+10]suggeststhatfornormaluser-orsystem-codetheyarealsoTSO.Theirworkisinasimilarspirittoourown,withamech-anisedsemanticsthatistestedagainstempiricalobserva-tion.ItaniumprovidesamuchweakermodelthanTSO,butonewhichismorepreciselydenedbythevendorthanx86[Int02];ithasalsobeenformalisedinTLA[JLM+03]andinhigher-orderlogic[YGLS03].ForPower,therehavebeenseveralpreviousmodels,butnonearesatisfactoryforreasoningaboutrealisticconcurrentcode.Inpartthisisbecausethearchitecturehaschangedovertime:thelwsyncbarrierhasbeenadded,andbarri-ersarenowcumulative.Corella,StoneandBarton[CSB93]gaveanearlyaxiomaticmodelforPowerPC,but,asAdiretal.note[AAS03],thismodelis\rawed(itpermitsthenon-SCnalstateoftheMP+syncsexampleweshowinx2).StoneandFitzgeraldlatergaveaprosedescriptionofPowerPCmemoryorder,largelyintermsofthemicroarchitectureofthetime[SF95].Gharachorloo[Gha95]givesavarietyofmodelsfordierentarchitecturesinageneralframework,butthemodelforthePowerPCisdescribedas\approxi-mate";itisapparentlybasedonCorellaetal.[CSB93]andonMayetal.[MSSW94].AdveandGharachorloo[AG96]makeclearthatPowerPCisveryrelaxed,butdonotdis-cusstheintricaciesofdependency-inducedordering,orthemoremodernbarriers.Adir,Attiya,andShurekgiveade-tailedaxiomaticmodel[AAS03],intermsofavieworderforeachthread.Themodelwas\developedthroughaniterativeprocessofsuccessiverenements,numerousdiscussionswiththePowerPCarchitects,andanalysisofexamplesandcoun-terexamples",anditsconsequencesforanumberoflitmustests(someofwhichweusehere)aredescribedindetail.Thesefactsinspiresomecondence,butitisnoteasytounderstandtheforceoftheaxioms,anditdescribesnon-cumulativebarriers,followingthepre-PPC1.09PowerPCarchitecture;currentprocessorsappeartobequitedier-ent.Morerecently,ChongandIshtiaqgiveapreliminarymodelforARM[CI08],whichhasaverysimilararchitectedmemorymodeltoPower.Inourinitialworkinthisarea[AFI+09],wegaveanaxiomaticmodelbasedonareadingofthePowerISA2.05andARMARMspecications,withexperimentalresultsforafewtests(describedasworkinprogress);thisseemstobecorrectforsomeaspectsbuttogiveanunusablyweaksemanticstobarriers.Morerecently,wegavearatherdierentaxiomaticmodel[AMSS10],furtherdevelopedinAlglave'sthe-sis[Alg10]asaninstanceofageneralframework;itmodelsthenon-multi-copy-atomicnatureofPower(withexamplessuchasIRIW+addrscorrectlyallowed)inasimpleglobal-timesetting.Theaxiomaticmodelissoundwithrespecttoourexperimentaltests,andonthatbasiscanbeusedforreasoning,butitisweakerthantheobservedbehaviourorarchitecturalintentforsomeimportantexamples.More-over,itwasbasedprincipallyonblack-boxtestinganditsrelationshiptotheactualprocessorimplementationsislessclearthanthatfortheoperationalmodelwepresenthere,whichismorermlygroundedonmicroarchitecturalandar-chitecturaldiscussion.Inmoredetail,theaxiomaticmodelisweakerthanonemightwantforlwsyncandforcumulativity:itallowsMP+lwsync+addrandISA2+sync+data+addr,whicharenotobservedandwhichareintendedtobear-chitecturallyforbidden.ItalsoforbidsR01,whichisnotob-servedbutarchitecturallyintendedtobeallowed,andwhichisallowedbythemodelgivenhere.Thetwomodelsarethusincomparable.WementionalsoLea'sJSR-133CookbookforCompilerWriters[Lea],whichgivesinformal(andapproximate)mod-elsforseveralmultiprocessors,andwhichhighlightstheneedforclearmodels.10.ConclusionTosummariseourcontribution,wehavecharacterisedtheactualbehaviourofPowermultiprocessors,byexampleandbygivingasemanticmodel.Ourexamplesincludenewtestsillustratingseveralpreviouslyundescribedphenomena,togetherwithvariationsofclassictestsandalargesuiteofautomaticallygeneratedtests;wehaveexperimentallyinvestigatedtheirbehaviouronarangeofprocessors.Ourmodelis:rigorous(inmachine-typecheckedmathematics);experimentallyvalidated;accessible(inanabstractmachinestyle,anddetailedhereinafewpagesofprose);usable(aswitnessedbytheexplanationsofexamples);supportedbyatool,forcalculatingthepossibleoutcomesoftests;andsucienttoexplainthesubtlebehaviourexposedbyourexamplesandtesting.Itisanewabstraction,maintainingcoherenceandcumulativitypropertiesbyatbutmodellingout-of-orderandspeculativeexecutionexplicitly.ThemodelshouldprovideagoodintuitionfordevelopersofconcurrentsystemscodeforPowermultiprocessors,e.g.ofconcurrencylibraries,languageruntimes,OSkernels,andoptimisingcompilers.Moreover,astheARMarchitecturememorymodelisverysimilar,itmaywellbeapplicable(withminoradaptation)toARM.Themodelalsoopensupmanydirectionsforfuturere-searchinvericationtheoryandtools.Forexample,itisnowpossibletostateresultsaboutthecorrectcompilationoftheC++0xconcurrencyprimitivestoPowerprocessors,andtoconsiderbarrier-anddependency-awareoptimisa-tionsinthatcontext.Wehavefocussedhereprimarilyon theactualbehaviourofimplementations,butthereisalsoworkrequiredtoidentifytheguaranteesthatprogrammersactuallyrelyon,whichmaybesomewhatweaker|someofourmoreexoticexamplesarenotnaturaluse-cases,tothebestofourknowledge.Thereisalsofutureworkrequiredtobroadenthescopeofthemodel,whichherecoversonlycacheablememorywithoutmixed-sizeaccesses.Wedescribedourmodelprincipallyinitsowntermsandintermsoftheobservedbehaviour,withoutgoingintode-tailsoftherelationshipbetweenthemodelandtheunderly-ingmicroarchitecture,orwiththevendorarchitecturespec-ication[Pow09];thisremainsfuturework.Acentralnotionofthememorymodeltextinthelatteristhatofwhenamemoryreadorwritebyonethreadisperformedwithre-specttoanother,whichhasahypothetical(orsubjunctive)denition,e.g.forloads:\AloadbyaprocessorP1isper-formedwithrespecttoaprocessorP2whenthevaluetobereturnedcannolongerbechangedbyastorebyP2",wherethatP2storemaynotevenbepresentintheprogramun-derconsideration(again,ARMissimilar).Thisdenitionmadeperfectsenseintheoriginalwhite-boxsetting[DSB86],wheretheinternalstructureofthesystemwasknownandonecanimaginethehypotheticalstorebyP2appearingatsomeinternalinterface,butintheblack-boxsettingofacommercialmultiprocessor,itishardorimpossibletomakeitprecise,especiallywithexamplessuchasPPOCA.Ourabstract-machinemodelmayprovideasteppingstoneto-wardsimprovedarchitecturaldenitions,perhapsvianewaxiomaticcharacterisations.AcknowledgementsWeacknowledgefundingfromEP-SRCgrantsEP/F036345,EP/H005633,andEP/H027351,fromANRprojectParSec(ANR-06-SETIN-010),andfromINRIAassociatedteamMM.References[AAS03]A.Adir,H.Attiya,andG.Shurek.Information-\rowmodelsforsharedmemorywithanapplicationtothePowerPCarchitecture.IEEETrans.ParallelDistrib.Syst.,14(5):502{515,2003.[AFI+09]J.Alglave,A.Fox,S.Ishtiaq,M.O.Myreen,S.Sarkar,P.Sewell,andF.ZappaNardelli.These-manticsofPowerandARMmultiprocessormachinecode.InProc.DAMP2009,January2009.[AG96]S.V.AdveandK.Gharachorloo.Sharedmemoryconsistencymodels:Atutorial.IEEEComputer,29(12):66{76,1996.[Alg10]JadeAlglave.ASharedMemoryPoetics.PhDthesis,UniversiteParis7{DenisDiderot,November2010.[AMSS10]J.Alglave,L.Maranget,S.Sarkar,andP.Sewell.Fencesinweakmemorymodels.InProc.CAV,2010.[AMSS11]J.Alglave,L.Maranget,S.Sarkar,andP.Sewell.Litmus:Runningtestsagainsthardware.InProc.TACAS,2011.[ARM08]ARM.ARMBarrierLitmusTestsandCookbook,October2008.PRD03-GENC-0078262.0.[BA08]H.-J.BoehmandS.Adve.FoundationsoftheC++concurrencymemorymodel.InProc.PLDI,2008.[BOS+11]M.Batty,S.Owens,S.Sarkar,P.Sewell,andT.Weber.MathematizingC++concurrency.InProc.POPL,2011.[CI08]N.ChongandS.Ishtiaq.ReasoningabouttheARMweaklyconsistentmemorymodel.InMSPC,2008.[Col92]W.W.Collier.Reasoningaboutparallelarchitectures.Prentice-Hall,Inc.,1992.[CSB93]F.Corella,J.M.Stone,andC.M.Barton.AformalspecicationofthePowerPCsharedmemoryarchi-tecture.TechnicalReportRC18638,IBM,1993.[DSB86]M.Dubois,C.Scheurich,andF.Briggs.Memoryaccessbueringinmultiprocessors.InISCA,1986.[Gha95]K.Gharachorloo.Memoryconsistencymodelsforshared-memorymultiprocessors.WRLResearchRe-port,95(9),1995.[Int02]Intel.AformalspecicationofIntelItaniumproces-sorfamilymemoryordering,2002.developer.intel.com/design/itanium/downloads/251429.htm.[JLM+03]R.Joshi,L.Lamport,J.Matthews,S.Tasiran,M.Tuttle,andY.Yu.Checkingcache-coherenceprotocolswithTLA+.Form.MethodsSyst.Des.,22:125{131,March2003.[KSSF10]R.Kalla,B.Sinharoy,W.J.Starke,andM.Floyd.Power7:IBM'snext-generationserverprocessor.IEEEMicro,30:7{15,March2010.[Lea]D.Lea.TheJSR-133cookbookforcompilerwriters.http://gee.cs.oswego.edu/dl/jmm/cookbook.html.[LSF+07]H.Q.Le,W.J.Starke,J.S.Fields,F.P.O'Connell,D.Q.Nguyen,B.J.Ronchetti,W.Sauer,E.M.Schwarz,andM.T.Vaden.IBMPOWER6microar-chitecture.IBMJournalofResearchandDevelop-ment,51(6):639{662,2007.[MSSW94]C.May,E.Silha,R.Simpson,andH.Warren,edi-tors.ThePowerPCarchitecture:aspecicationforanewfamilyofRISCprocessors.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1994.[OBZNS]S.Owens,P.Bohm,F.ZappaNardelli,andP.Sewell.Lightweighttoolsforheavyweightsemantics.Sub-mittedforpublicationhttp://www.cl.cam.ac.uk/~so294/lem/.[OSS09]S.Owens,S.Sarkar,andP.Sewell.Abetterx86memorymodel:x86-TSO.InProc.TPHOLs,pages391{407,2009.[Pow09]PowerISATMVersion2.06.IBM,2009.[SF95]J.M.StoneandR.P.Fitzgerald.StorageinthePowerPC.IEEEMicro,15:50{58,April1995.[SFC91]P.S.Sindhu,J.-M.Frailong,andM.Cekleov.Formalspecicationofmemorymodels.InScalableSharedMemoryMultiprocessors,pages25{42.Kluwer,1991.[SKT+05]B.Sinharoy,R.N.Kalla,J.M.Tendler,R.J.Eicke-meyer,andJ.B.Joyner.POWER5systemmicroar-chitecture.IBMJournalofResearchandDevelop-ment,49(4-5):505{522,2005.[Spa92]TheSPARCArchitectureManual,V.8.SPARCInternational,Inc.,1992.RevisionSAV080SI9308.http://www.sparc.org/standards/V8.pdf.[SSA+11]S.Sarkar,P.Sewell,J.Alglave,L.Maranget,andD.Williams.UnderstandingPOWERmul-tiprocessors.www.cl.cam.ac.uk/users/pes20/ppc-supplemental,2011.[SSO+10]P.Sewell,S.Sarkar,S.Owens,F.ZappaNardelli,andM.O.Myreen.x86-TSO:Arigorousandusableprogrammer'smodelforx86multiprocessors.Com-municationsoftheACM,53(7):89{97,July2010.[SSZN+09]S.Sarkar,P.Sewell,F.ZappaNardelli,S.Owens,T.Ridge,T.Braibant,M.Myreen,andJ.Alglave.Thesemanticsofx86-CCmultiprocessormachinecode.InProc.POPL2009,January2009.[YGLS03]Y.Yang,G.Gopalakrishnan,G.Lindstrom,andK.Slind.AnalyzingtheIntelItaniummemoryor-deringrulesusinglogicprogrammingandSAT.InProc.CHARME,LNCS2860,2003.