Hierarchical Cache Coherence Protocol erication One Le el at ime Through AssumeGuarantee Xiaof ang Chen ang Ganesh Gopalakrishnan ChingTsun Chou School of Computing Uni ersity of Utah Intel Corporati
161K - views

Hierarchical Cache Coherence Protocol erication One Le el at ime Through AssumeGuarantee Xiaof ang Chen ang Ganesh Gopalakrishnan ChingTsun Chou School of Computing Uni ersity of Utah Intel Corporati

Explicit state enumeration methods ar almost always used or coher ence pr otocol eri57346 cation as symbolic methods ha failed to deli er adv antages in this ar ea The mo to wards multicor es implies that hierar chical or ganizations of se eral diff

Download Pdf

Hierarchical Cache Coherence Protocol erication One Le el at ime Through AssumeGuarantee Xiaof ang Chen ang Ganesh Gopalakrishnan ChingTsun Chou School of Computing Uni ersity of Utah Intel Corporati




Download Pdf - The PPT/PDF document "Hierarchical Cache Coherence Protocol er..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Hierarchical Cache Coherence Protocol erication One Le el at ime Through AssumeGuarantee Xiaof ang Chen ang Ganesh Gopalakrishnan ChingTsun Chou School of Computing Uni ersity of Utah Intel Corporati"— Presentation transcript:


Page 1
Hierarchical Cache Coherence Protocol erication One Le el at ime Through Assume-Guarantee Xiaof ang Chen ang, Ganesh Gopalakrishnan Ching-Tsun Chou School of Computing, Uni ersity of Utah Intel Corporation Salt Lak City UT 84112 Santa Clara, CA 95054 Abstract Due to the err or -pr one natur of moder cache coher ence pr otocols, in all moder pr ocessor design o ws these pr otocols ar ormally specied at the le el of interlea ving atomic transactions and model check ed. Explicit state enumeration methods ar almost always used or coher ence pr otocol

eri- cation, as symbolic methods ha failed to deli er adv antages in this ar ea. The mo to wards multicor es implies that hierar chical or ganizations of se eral differ ent cache coher ence pr otocols will be employ ed in futur e. The pr oduct state space of all these pr otocols jointly operating in multicor cache hierar ch is bey ond the each of all ailable explicit state model check ers. In [1], an assume guarantee technique that allo wed these pr otocols to be handled or the rst time was eported. In this appr oach, method was pr oposed to cr eate set of initial abstract pr

otocols Abs #i wher each Abs #i simulates the gi en hierar chical pr otocol. Since the arious Abs #i depend on each other erication consists of dealing with the set Abs #i in an assume guarantee manner ening Abs #i in the pr ocess. The drawbacks of [1] wer e: (i) en one single Abs #i modeled mor than one cluster; in particular portions of other clusters and dir ectory structur es wer also modeled, thus still cr eating ery lar ge pr oduct spaces, (ii) details such as non-inclusi caching hierar chies could not be handled. This paper er comes both these limitations, handling

non-inclusi caching hierar chies, and bringing about 95% eduction in the total state space encounter ed during any single explicit enumeration sear ch, and equiring only few such runs to nish erication. Cache coherence protocols are central to all future ad- ances in chip multiprocessors (CMPs), otherwise kno wn as `multicores. Modern multiple chip-multiprocessors (M-CMP) emplo xtremely comple cache coherence protocols, to ensure high performance and to accommodate manuf acturing realities such as unordered coherence message netw orks. In order to eliminate concurrenc ugs from

these protocols, modern industrial practice requires that nite state models of these protocols be xhausti ely eried not just deb ugged using semi-formal methods. This is achie ed in most cases by modeling small instances of these protocols (e.g., three CPUs handling tw addresses and one bit of data) in terms of interlea ving atomic steps in guard/action languages such as Murphi [2] or TLA+ [3], and xploring the reachable states through xplicit state enumeration. The implementation of the atomic steps is an important, ut separate, problem, not addressed in this ork (it may be

addressed through renement erication [4] or synthesis [5]). This ork is supported in part by SRC Contract 2005-TJ-1318. Main contact: xiachen@cs.utah.edu. Researchers ha tried ut ailed to erify the inter lea ving models of cache coherence protocols using symbolic methods (BDD/SA T) methods. In [6], it is reported that modest benchmark protocol with 64-bit state ector and fe pages of description could not be eried, when the ery same set of techniques could handle CPU model with millions of state bits. The main reason for this ailure is that it is ery dif cult to

project ay the state bits from cache coherence protocol models. What this means in terms of modern practice is that for industrial protocols that occup 50 or more pages of description with hundreds of state bits, xplicit state enumeration methods must cope with erication comple xity L1 Cache L1 Cache L2 Cache + Local Dir RAC L1 Cache L1 Cache L2 Cache + Local Dir RAC L1 Cache L1 Cache L2 Cache + Local Dir RAC Global Dir Main Mem Remote cluster 1 Remote cluster 2 Home cluster Fig. 1. 2-le el hierarchical cache coherence protocol. Multicore chips will be or ganized in hier ar hies (e.g.,

Figure 1). At present, it is common to see four quad-core chips forming 16-w ay multiprocessor These or ganizations will emplo tw le els of cache protocols called (i) the intr cluster le el where the L1 caches of the CPUs will be ept coherent among themselv es and with respect to local L2 cache le el, and (ii) the inter cluster le el where the clusters themselv es will be ept coherent. This is bad ne ws, considering that the intra cluster protocol state space itself will be ery lar ge, and that an pr oduct of the states of three intra cluster protocols and one inter cluster protocol (Figure 1)

ould be unacceptably lar ge. In our pre vious ork [1], we presented compositional approach for partly mitigating this problem. The orko of the approach is sho wn in Figure 2. Gi en an M-CMP protocol, we rst uilt set of abstract protocols Abs #i Each Abs #i erapproximates, by construction, the original protocol (see Figure which il- lustrates one such abstract protocol). Ho we er it is also the case that each Abs #i emplo ys assumptions that will be alidated only by erifying some number of Abs #j that,
Page 2
then, justies these assumptions. In turn, Abs #j

emplo ys assumptions that are justied by erifying some number of Abs #i (further details of such meta-circular dependencies are discussed in Section II-A). In [1], we sho that (i) each Abs #i has ar less states than the original hierarchical pro- tocol, (ii) the additi comple xity of erifying the Abs #i in turn is also ar less than the comple xity of the original protocol. Ho we er as will be seen from Figure 3, en one Abs #i in olv es the pr oduct state space of one entire unit (such as `Home cluster' in Figure 3) and tw simplied units (such as `Remote clusters' in Figure 3).

Original protocol hierarchical Abs #i Model checker Coherence violation? Analyze counterexample find out overly approximated rule verification obligation Add new Abs #j ... ... Spurious bug Genuine bug Fix bugs Strengthen guard Fig. 2. The orko of our approach. This paper addresses the follo wing questions: Can we contain the erication comple xity so that we ne er incur more than the state space of one entire clus- ter? In other ords, can we oid incurring the product state space comple xity ho we er small the “remote unit state spaces are? This is because en if the remote units

ha just two states each, we ould be dealing with four times the number of states in an entire unit which is ery high. In this paper we sho that this is possible by modeling the interf aces of the protocols in particular ay (Section II-B). The result is that we are able to reduce the number of states xamined in an one abstract protocol to under 5% of the number of states xamined in [1]. What other complications arise in realistic caching hi- erarchies and ho do we handle it? Simpler M-CMP protocols use the inclusive caching hierarchy That is, the content of an L1 cache is subset of that of the

L2 cache in the cluster Protocols which use non-inclusive caching hierarchies are ar more comple x. sho that by using history ariables ut in no el ay (Section III-C), we can xtend the approach of [1]. Ho can “standard tricks be safely applied? One stan- dard trick in most cache coherence ercises is to main- tain the latest copy of cache line in book-k eeping ariable. In Section II-B, we sho the perils of blindly xtending this trick to no maintaining intr copy and an inter copy and of fer solutions. In Section II-B and III-B, we sho with xamples, why the erication of hierarchical

protocol is simply not matter of erifying each le el of the hierarchy separately; we sho ho impr ecise ecor ds arise, and ho we deal with them. Related ork: erm Re writing method for reasoning about the correctness of hierarchical cache coherence protocols as presented in [7]. This ork has not been demonstrated on protocols of realistic sizes as well as realistic features such as the inclusion relationship between caches, unordered netw orks, etc. As xplained in [1], our protocols were designed by consulting with an industrial xpert who helped us include these features, and also additional

features such as the silent dropping of cache lines which introduces additional corner cases. Thus, en in terms of the comple xity of the protocols within one cluster our protocols ar xceed the comple xity of popular benchmarks such as FLASH [8] and German [9]. ha already compared our ork in this paper agains our past ork in [1]. Our paper in [1] and this paper deri their basic ideas from Chou et. al. ork [10] which as method for parameterized erication for non-hierarchical cache coherence protocols, and McMillan' ork on composi- tional model checking [11]. While the use of history

variables in program erication goes back se eral decades (e.g. [12], [13]), our usage of history ariables is in the conte xt of assume-guarantee reasoning, and it in olv es meta-circular approach as discussed in Section III-C. A. Bac kgr ound In the M-CMP system eried in [1] (Figure 1), ery cluster contains tw processors of identical design, each with pri ate L1 cache. The L2 cache is shared by the tw processors, and its content is superset of that of the L1 caches, i.e. the inclusive caching or ganization is used. The remote access controller (RA C) is used by cluster to

communicate with other clusters. model one address in our model, and associate the home cluster with the main memory for this address. In reality the main memory is attached to ery cluster; the act there is only one memory is consequence of the -address abstraction of our protocol. The tw remote clusters are of identical design. The protocol used inside cluster is directory-based MESI protocol [14], [15], maintaining the coherence for the caches within cluster The local directory records which L1 cache(s) or whether the L2 cache has alid cop and in which specic state. The protocol used

among clusters is also directory-based MESI protocol. The global directory only has the high-le el information of which cluster holds alid cop in which specic state. Other features of this protocol include: silent- drops are supported on non-Modied cache lines, and ii netw ork channels are modeled in non-FIFOed ordering, e.g. messages can arri out of order In [1], we presented an approach for decomposing hierar chical protocol into set of abstract protocols, using abstrac- tion and counter -e xample guided renement in an assume- guarantee manner or the protocol sho wn

as in Figure 1, we decompose the protocol into three abstract protocols. In ery abstract protocol, one detailed cluster is maintained, and the rest tw are abstracted. Because the tw remote clusters are of identical design, there are altogether tw distinct abstract protocols. Figure sho ws an abstract protocol where the details of the home cluster are maintained (contrast with Figure 1). “Local Dir represents part of the local directory “Local Dir”. In the approach of [1], the abstract protocols are obtained by erapproximating the original protocol by projecting ay
Page 3
L1 Cache L1

Cache L2 Cache + Local Dir RAC L2 Cache + Local Dir RAC Global Dir Main Mem Remote cluster 1 Remote cluster 2 Home cluster L2 Cache + Local Dir RAC Fig. 3. One of the tw abstract protocols in the pre vious approach. intra-cluster details of dif ferent clusters. The follo wing gure sho ws an xample of transition before and after abstraction. L1 Cache L1 Cache L2 Cache + Local Dir RAC WB L2 Cache + Local Dir RAC Clusters[c].WbMsg.Cmd = WB ==> Clusters[c].L2.Data := Clusters[c].WbMsg.Data; Clusters[c].L2.HeadPtr := L2; ... True ==> Clusters[c].L2.Data := nondet; ... Fig. 4. transition

before and after abstraction. In Figure 4, writeback request from an L1 cache is recei ed. ith this request, the local directory updates the L2 cache cop and records that the L2 cache has the latest cop in the cluster After this cluster is abstracted, the L1 caches and the writeback netw ork channel are remo ed. So for ery transition which in olv es the details of these components, it is erapproximated. In this xample, the guard is erapproxi- mated to true, i.e. it becomes more permissi e. The cache cop in the writeback request is replaced with nondeterministic cop “nondet”, and the second

statement in action is remo ed. When we model check an abstract protocol, violations to coherence properties can be found, due to genuine ugs in the original protocol, or erapproximation of the abstraction. or genuine ugs, we x them in the original protocol, and re-generate the abstract protocols. or spurious ugs, we do the renement. or the xample in Figure 4, it is clear that the abstracted transition can easily introduce coherence violations, because the L2 cache can update its cop arbitrarily or this xample, we can strengthen the guard with the formula that 1) the L2 cache

is in xclusi or modied state. Also, we add ne erication obligation to one of the abstracted protocols, ensuring that whene er there is writeback request in cluster 1) must hold. This is characteristics of the inclusi caching or ganization, which does not necessarily hold for non- inclusi ones, as will be sho wn in Section III. The question of adding the erication obligation to which abstract protocol, depends on which one maintains the details for this cluster Since each abstract protocol maintains the details of dif ferent The visual analogy of ho lar ge mirror

telescopes are made of mirror se gments may help. part of the original protocol, the abstract protocols depend on each other to justify the renement. The problem with the abo decomposition approach is that, ery abstract protocol still contains one intra- and one inter -cluster protocols. So its state space is still ery lar ge. or xample, we sho that (see Section II-E) one abstract protocol has about billion of state space, with 40 -bit of hash compaction. This some what reaches the upper limit of the number of states currently xplored by an xplicit state model check er in day or tw on

po werful machine. Also, there are duplicated beha viors among dif ferent abstract protocols. The proposed method slashes the states do wn to 5% of the original, and the runtimes come do wn from day to under 10 minutes. B. Ne Decomposition Appr oac As the title of the paper suggests, our ne approach in olv es abstract protocols that in olv either the intra- or the inter cluster protocol, ut ne er both. The techniques used to uild and erify the abstract protocols are similar to before: ery abstract protocol is obtained by erapproximating the origi- nal protocol, through projecting ay dif ferent

components; ii in the counter -e xample guided renement for abstract protocols, the formula used to strengthen the guard of tran- sition, will be pro ed as an in ariant with assume-guarantee reasoning; iii we use simulation to pro the soundness of our approach. The dif ference with our pre vious approach lies in modeling the interf aces between the intra- and the inter cluster protocols. In [1], the interf aces were modeled in “tightly-coupled manner More precisely some Murphi transitions in olv the details of both the intra- and the inter -cluster protocols. or instance, in one

transition, the conditions under which the transition is enabled are that local directory recei es request from an L1 cache asking for cache cop the L2 cache does not ha the cop and the local directory is not usy when the request is recei ed. Once the transition is enabled, the local directory is set to usy the RA of the cluster is also set to usy and request asking for cache cop is placed on the netw ork channel used among the clusters. If hierarchical protocols are modeled in this manner it is not easy to see ho to decompose the protocol into “per le el abstract protocols. or xample, in the

transition we just described, after the cluster is abstracted, the guard becomes that the L2 cache does not ha the cop In the renement process, we need to strengthen the guard using the act that the RA of the cluster must not be usy ensure that the guard strengthening is sound, we ha to pro that whene er the local directory is not usy the RA of the cluster must not be usy This equir es one abstr act pr otocol to maintain the details of whole cluster including the L1 and L2 cac hes, the local dir ectory and the RA C. Such abstractions pre ent us from decomposing the protocol “per -le

el. In our current approach, interf aces are modeled as “loosely- coupled. In more detail, transition is allo wed to in olv the details of either an intra- or the inter -cluster details, including the interf ace between them. It is not allo wed to in olv the
Page 4
details of both le els. Writing erication models in this manner is not dif cult, and does not mask an of the corner cases that were there before. or the xample in the abo e, the transition can be di vided into tw transitions in our ne protocol: (i) the rst sets the local directory to usy and puts

the request on the interf ace; (ii) when the request on the interf ace is detected, and the RA of the cluster is free, the second transition forw ards the request to outside the cluster clears the interf ace, and sets the RA to usy Other than better interf ace characterization, in de eloping the ne decomposition approach, we also impro ed the ay ho coherence properties should be represented for hierar chical protocols. In the M-CMP protocol presented in [1], the intra- and inter -CMP protocols each uses an auxiliary ariable to eep track of the latest cop inside itself. Let these ariables be

intr copy and inter copy respecti ely or each cluster intr copy is initially undened. It is updated when cluster recei es reply which grants alid cop recei es an in alidate request from outside the cluster or an L1 cache inside the cluster updates its cop in xclusi or modied state. On the other hand, inter copy is initially set to the cop in the main memory It is updated when an L2 cache recei es writeback or shared writeback request, or reply which grants alid cop from inside cluster So the alues of intr copy and inter copy depend on each other In de eloping the ne

decomposition approach, we disco v- er ed that these latest copies need to be constr ained car efully on the interface Figure sho ws simple scenario of uggy protocol. Global Dir Remote cluster-1 L1-1 L2 + Local Dir ... Remote cluster-2 L1-1 L2 + Local Dir ... (I) (I) (E, d0) (M, d1) (E, d0) 1. GetX 2. Fwd_GetX 3. Fwd_GetX 4. Fwd_GetX (I) 5. PutX, d2 6. Fwd_PutX 7. Fwd_PutX 8. PutX (I) (E, d2) (E, d2) (E, d2) Fig. 5. scenario of non-coherent cache protocol. In Figure 5, remote cluster -1 initially has modied cop in L1 cache-1, and intr copy is just In Step 1, an L1 cache of remote

cluster -2 generates GetX request, and this request is forw arded to the local directory of remote cluster 2, then to the global directory and nally to L1 cache-1 of remote cluster -1 in Step 4. In Step 5, L1 cache-1 grants the request, sets itself to in alid, ut mistak enly replies with some other data d2 On recei ving the reply in Step 6, inter copy is set to d2 and intr copy is undened. Thus, the latest cop in the system is lost. or the xample in Figure 5, if intr copy and inter copy are not constrained in an appropriate ay model checking cannot detect the ug. This also sho

ws that when tw protocols are coupled hierarchically more corner cases can come up due to the interf aces between the protocols. Specically for this scenario, in the be ginning of Step 6, constraint that inter copy is the same as intr copy of remote cluster -1 should be asserted. Or in summary unique latest copy should be used in the whole system. The alue for this unique latest cop is initially set to that in the main memory and it is updated only when an L1 cache updates its cop in xclusi or modied state. C. Details of the Ne Appr oac L2 Cache + Local Dir RAC Global Dir Main

Mem Remote cluster 1 Remote cluster 2 Home cluster L2 Cache + Local Dir RAC L2 Cache + Local Dir RAC L1 Cache L1 Cache L2 Cache + Local Dir Remote cluster 1 Home cluster L1 Cache L1 Cache L2 Cache + Local Dir Abs. protocol #1 Abs. protocol #2 Abs. protocol #3 Fig. 6. The three abstract protocols from the ne decomposition approach. still use Figure as our current M-CMP protocol. or this protocol, our ne approach will decompose it into four abstract protocols. Because the tw remote clusters are of identical design, the tw abstract intra-cluster protocols are the same. So Figure only sho ws three

distinct abstract protocols. The rst one only contains the intra-cluster protocol of the home cluster The second one is similar xcept that it is for remote cluster The clouds in the gure represent the en vironment outside cluster The en vironment will nondeter ministically generate requests or replies to the cluster It can be automatically obtained from abstraction. Finally the third one contains the inter -cluster protocol which is used among the clusters. In the follo wing, we will illustrate ho each abstract protocol is created. 1) Abstr action: As in [1], the deri ation of

each abstract protocol from the original protocol in olv es the ariables abstraction and the transition relation and in ariants abstrac- tion. Abstraction for state ariables is simply projection. or xample, to create the rst abstract protocol, the RA of the home cluster is projected ay and the rest of the system other than the home cluster is also dropped. or transition relation, gi en transition in the original protocol, one or more corresponding transitions will be created. In more detail, consider rule guarded transition in the form of guar action in hierarchical protocol. If sub-e

xpression in guar contains ariable that has been projected ay we replace the sub-e xpression with true. or statements of the
Page 5
form := in action if has been dropped, then the assignment is dropped; ii if contains ariables that ha been dropped, then will be replaced with nondeterministic alue er the type of Other statements and in ariants are processed similarly The abo process ensures that ery abstract protocol is an erapproximation of part of the original system. or xample in Figure 6, the rst one is an erapproximation of the intra- cluster protocol of the home cluster

and others are similar This ensures that in our approach, by composing all the abstract protocols, it can co er ery reachable state in the original protocol. 2) Counter -Example Guided Renement: The renement process is the same as before. or ery spurious ug of an abstract protocol, we strengthen the guard of the erly approximated transition, and also add ne erication obli- gation to one of the abstracted protocols. or the abstracted protocols in Figure 6, the rst tw will add their erication obligations to the third one. On the other hand, the third one

will add some of its erication obligations to the rst, and add the rest to the second. So abstract protocols depend on each other to justify the renement. No consider one xample of the renement process. Suppose we model unique latest cop in the system, denoted as latest copy This ariable is updated only when an L1 cache updates its cop in xclusi or modied state. After abstrac- tion, it becomes that latest copy can be updated arbitrarily in the third abstract protocol, because the the details in olving L1 caches are projected ay iolations to data

coherence properties can be easily detected due to the erapproximation. do the renement, we strengthen the rule to be: when the L2 cache of cluster is xclusi or modied, can latest copy be updated. In the meanwhile, we add erication obligation to the rst and second abstract protocols, ensuring that when an L1 cache updates its cop the L2 cache of the cluster must be xclusi or modied. This is the characteristics of the inclusi caching or ganization. D. Soundness of The Appr oac From the abo e, we can see that the renement process is “circular”: the

abstract intra-cluster protocol depends on the abstract inter -cluster protocol for the soundness of the guard strengthening, and vice ersa. In act, the soundness of the renement can be justied by temporal induction based assume guarantee reasoning. The intuiti reason is that all the erication obligations which are added, are check ed from the initial states of the abstract protocols, step by step, and ery abstract protocol is an erapproximation of part of the original protocol. More details of the formal proof can be found in [1]. Finally we need to pro that if all the

abstract protocols can be eried coherent, the hierarchical protocol must be coherent. This tak es the same proof as in [10]. That is, gi en an M-CMP protocol and the set of abstract protocols which are obtained using our approach, there xists simulation relation between and each Also, ery reachable state of is contained in the reachable states of the composition of all the abstract protocols. Moreo er for ery coherence property in there is corresponding coherence property in each Ev ery is obtained by abstracting (see Section II-C.1) in and the are strong enough to represent In our

M-CMP xample, these all into tw cate gories. The rst is that some is xactly the same as and all the others are just true. The second is that is xactly in the form of i.e. the conjunction of all the s. So for ery holds. E. Experimental Results Figure sho ws the xperimental results of the M-CMP coherence protocol, using the traditional model checking, the pre vious approach, and the current approach. The rst three xperiments were performed on an Intel IA-64 machine, and the last three were performed on PC with an Intel Pentium CPU of 3.0GHz. 40 -bit hash compaction as used in all

the xperiments. Model check passed Use mem (GB) 18 18 18 1.8 1.8 1.8 Model check time (sec) > 125,410 44,978 66,249 270 50 21 # of states > 438,120,000 284,088,425 636,613,051 1,500,621 574,198 198,162 Full model Abs. model 1 Abs. model 2 Abs. model 1 Abs. model 2 Abs. model 3 Classical approach Previous approach Current approach Non conclusive Yes Yes Yes Yes Yes Fig. 7. erication comple xity using dif ferent approaches. Here, we emplo yed the Murphi model check er for the x- periment. In the abo table, model checking on the full model using the classical approach ailed after more

than billion of states, due to state xplosion. The pre vious decomposition approach as able to erify the protocol. Ho we er the state space of each abstract protocol is still lar ge. Using the ne approach, the M-CMP protocol can be easily eried, with less than GB of memory It can reduce more than 95 percent of the state space of the original protocol. FI ill no the M-CMP protocols we ha discussed use the inclusive caching hierarchy That is, the content of the L1 cache is subset of that of the L2 cache on the same cluster Other than inclusi e, there are tw more caching hierarchies:

xclusive and non-inclusive Exclusi means that an block that is present in an L1 cache cannot be in the L2 cache on the cluster Non-inclusi lies between inclusi and xclusi e: erlaps and without containment are allo wed. or illustration, some processors of the Intel Pentium amily use non-inclusi caches, and processors of AMD Atholon and Operton use xclusi caches. or inclusi caches, upon cache miss in an L1 cache that hits in the L2, the cache controller only needs to cop the data to the missing L1 cache. On the other hand, when block is replaced in the L2 cache due to conict or capacity

miss, the same block must be victed from all the L1 caches
Page 6
of the cluster or xclusi caches, the ef fecti cache size of the system can be the sum of the L1 and L2 caches. As non- inclusi caches can co er both inclusi and xclusi caches, we will only focus on non-inclusi caches in the follo wing. A. erication Pr oblems Due to Non-inclusion or non-inclusi protocols, tw cate gories of problems mak the erication hard. First, cache coherence properties of non-inclusi protocols may ha to in olv L1 caches from dif ferent clusters. or xample, consider an oftenly used

coherence property: no tw caches can write to the same ad- dress concurrently or inclusi caches, we can represent this property using tw erication obligations: no tw clusters can ha their L2 caches both be xclusi or modied, and ii in ery cluster no tw L1 caches can both be xclusi or modied. Each of these erication obligations can be model check ed in abstract protocols using our ne decomposition approach. In contrast, for non-inclusi caches, when tw L1 caches from dif ferent clusters are both xclusi or modied, their corresponding L2 caches may not ha the

cop Since ery of our abstract protocols only maintains the details of at most one cluster it is not straightforw ard ho to represent the property that no tw L1 caches from dif ferent clusters can be both xclusi or modied. Second, for error traces corresponding to spurious ugs in abstract protocols, it is not straightforw ard ho to rene the erapproximation. This is because the L2 cache may not contain alid cop of the cache line which is present in some L1 cache of the cluster or xample, consider the follo wing scenario. Initially an L1 cache has an xclusi cop to writeback, and

the L2 cache on the cluster does not contain this block. After the writeback request is recei ed, the L2 cache transits to xclusi state with alid cop When this transition is abstracted in the inter -cluster protocol, it becomes that an L2 cache cop can change from in alid to xclusi arbitrarily Clearly this is ery coarse erapproximation and it can easily lead to coherence ailures. The abo are problems which xist especially in non- inclusi hierarchical coherence protocols. In the ne xt section, we will describe an approach for solving these problems. Our approach uses ariant of history variables

[12], [13], the ariation being that the alue of the history ariables is also determined in an assume-guarantee manner or the xperimental alidation of this approach, we created non- inclusi ariant of the benchmark protocol. This ne protocol is more comple than the inclusi protocol, and hence it can not be model check ed using traditional approaches. B. Pr otocol Details Our benchmark protocol [16] has the same congurations as the inclusi protocol presented in Section II-A. The intra- and inter -cluster all use an in alidation-based directory ESI proto- col [15]. or the non-inclusi M-CMP

protocol, we assume that when an L2 cache line is sw apped out, the local directory of the cluster is also sw apped out. This assumption may not be ery practical for real coherence protocols. Ho we er it mak es the erication problem harder and forces us to come up with ne erication techniques. or the netw ork channels used within cluster other than the ones used in the inclusi protocol, there also xists set of broadcasting channels. These channels are used when there is cache miss in the L2 and the local directory has no information of whether some L1 has alid cop or not. After

broadcast, if reply containing alid cop is recei ed, the reply will be forw arded to the requesting L1 cache. Otherwise, the request will be forw arded to the global directory The characteristics of non-inclusi of our protocol can also mak the local directory ha an imprecise record of cache line. The follo wing gure sho ws simple scenario of ho the imprecision can happen. Global Dir Remote cluster-1 L1-1 L2 + Local Dir (S, d) (S, d) (S, d) 1. Swap 3. Broadcast 2. GetS 4. Nack L1-2 (I) (I) 5. Fwd_GetS 6. Put, d 7. Put (S, d) (S, d) Fig. 8. Imprecise state record in the local directory

In Figure 8, initially an L1 cache and the L2 cache in remote cluster has shared cop and it is recorded in the local directory In Step the L2 cache line is sw apped out, and the record in the local directory is also dropped. In Step another L1 cache in the same cluster requests shared cop As the local directory has no record about this line, the request is broadcast inside the cluster in Step The L1 caches CK this request in Step as it is not safe for shared cop to supply its data because the broadcast request can be interlea ed with an in alidate request coming from outside the cluster In

Step the request is forw arded to the global directory and it is granted in Step At this time, the local directory has lost the information that an L1 cache already has shared cop Such imprecision can lead to coherence violations for certain cache coherence properties, because subsequent in alidations will miss the shared copies in some L1 caches. C. Infer “Exclusive No we will present ho the erication problems due to non-inclusi can be solv ed. Gi en cluster in which the L2 cache does not ha alid cop we can infer if there is an xclusi or modied cop on the cluster in tw ays.

One is to infer from outside the cluster i.e. the global directory and the netw ork channels used among clusters. The other is to infer from inside the cluster including the L1 caches and the netw ork channels used within the cluster These tw approaches are similar In the follo wing, we will describe the second one in detail.
Page 7
still use Figure to represent non-inclusi M-CMP protocol, and use the ne decomposition approach for the erication. The dif ference is that we will use the intra- cluster abstract protocols to infer whether there is an xclusi or modied cop

in cluster and this inferrence will be used by the inter -cluster abstract protocol. In more detail, we add an auxiliary ariable of boolean type for each cluster in the hierarchical protocol. Let this ariable be IE (implicit xclusi e). or ery cluster IE will be dened in the abstract intra-cluster protocol. Initially IE is set to alse. It is dened to be true if one of the follo wing conditions holds: If an L1 cache has an xclusi or modied cop or If the broadcast channel contains alid reply or If netw ork channel contains reply with an xclusi cop or If there is writeback

or shared writeback request, or If there is an xclusi wnership transfer request When the alue of an IE is true, it means that the cluster must ha an xclusi or modied cop in the cluster some where other than in the L2 cache. No with IE s, the rst problem mentioned in Section III-A can be solv ed similarly as in inclusi protocols. or xample, consider the coherence property that no tw L1 caches in dif ferent clusters can write to the same address concurrently No this property can be represented as no tw IE from dif ferent clusters can be both true, and no tw L2 caches can be both

xclusi or modied. or the second problem, we can no constrain erly approximated transitions using IE s. or the xample which can change an in alid L2 cache line to xclusi arbitrarily it no can be strengthened as: the L2 cache can change from in alid to xclusi e, if IE is true for the cluster As usual, to ensure that the strengthening is sound, we add erication obligation to one of the abstract intra-cluster protocols. In this case, the erication obligation requires that when writeback request is on the ay from an L1 cache to the L2 cache in cluster the IE must be true. D.

Experimental Results or this non-inclusi hierarchical protocol, we ha used the traditional and the current approaches for checking co- herence. Figure sho ws the xperimental results using these approaches. As for the inclusi protocol, 40 -bit hash com- paction as used in all the xperiments. The rst xperiment as performed on an Intel IA-64 machine, and the last three were performed on PC with an Intel Pentium CPU of 3.0GHz. Again, the Murphi model check er as emplo yed for the xperiments. From the table, we can see that the current approach can reduce the erication comple xity

in terms of state space and runtime, by more than 95 percent. Coherence is particular challenge for M-CMP systems, as usually the are more comple and ha more corner cases than non-hierarchical protocols. In our pre vious ork, Model check passed Use mem (GB) 18 1.8 1.8 1.8 Model check time (sec) > 161,398 770 250 248 # of states > 473,260,000 4,070,484 2,424,719 2,424,719 Full model Abs. model 1 Abs. model 2 Abs. model 3 Classical approach Current approach Non conclusive Yes Yes Yes Fig. 9. erication comple xity using dif ferent approaches. we presented compositional approach for

erifying an in- clusi directory-based M-CMP protocol, using abstraction and assume-guarantee reasoning. Ho we er the state space of abstract protocols is still ery lar ge. In this paper we present ne decomposition approach which models hierarchical pro- tocols using better interf ace characterization. Ev ery abstract protocol resulted from this approach only contains one in- stance of coherence protocols, and the approach is still sound. Furthermore, we also xplored ho our approach can be xtended to erify non-inclusi M-CMP coherence protocols, using ariant of history ariables. The xperimental

results sho that our ne approach is ery ef fecti in reducing the erication comple xity Our method presented earlier is implemented as manual form of abstraction and assume-guarantee reasoning. Cur rently we are mechanizing the abstraction process. That is, gi en hierarchical protocol and the ariables to be projected ay we try to automatically create an abstract protocol. plan to automate the spurious error trace recognition and counter -e xample guided renement. It will be important to understand ho to ef fecti ely combine our methods with au- tomatic learning algorithms found

in current model check ers. [1] X. Chen, ang, G. Gopalakrishnan, and C. Chou, “Reducing erication comple xity of multicore coherence protocol using as- sume/guarantee, in ormal Methods in Computer Aided Design 2006. [2] D. L. Dill, A. J. Dre xler A. J. Hu, and C. H. ang, “Protocol erication as hardw are design aid, in IEEE Intl. Confer ence on Computer Design: VLSI in Computer and Pr ocessor 1992. [3] L. Lamport, “Specifying concurrent systems with tla Calculational System Design 1999. [4] X. Chen, S. German, and G. Gopalakrishnan, “T ransaction based mod- eling and

erication of hardw are protocol implementations, Under submission. ailable upon request. [5] Arvind, “Bluespec: language for hardw are design, simulation, syn- thesis and erication, in MEMOCODE 2003. [6] K. McMillan and N. Amla, Automatic abstraction without countere x- amples, in ec hnical Report of Cadence 2003. [7] X. Shen and Arvind, “Specication of memory models and design of pro ably correct cache coherence protocols, MIT ech. Rep., 1997. [8] J. uskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. G. J. Chapin, D. Nakahira, J. Baxter M. Horo witz, A. Gupta, M.

Rosenblum, and J. Hennessy “The stanford ash multiprocessor in Pr oceedings of the 21st Intl. Symposium on Computer Ar hitectur 1994, pp. 302–313. [9] S. German, “T utorial on erication of distrib uted cache memory protocols, in ormal Methods in Computer Aided Design 2004. [10] C. Chou, K. Manna a, and S. ark, simple method for parameterized erication of cache coherence protocols, in ormal Methods in Computer Aided Design 2004. [11] K. McMillan, “V erication of innite state systems by compositional model checking, in Corr ect Har dwar Design and

erication Methods 1999. [12] E. M. Clark e, “Pro ving the correctness of coroutines without history ariables, in CM Southeast Re gional Confer ence 1978.
Page 8
[13] M. Clint, “Program pro ving: Coroutines, in Acta Informatica 1973. [14] M. apamarcos and J. atel, lo erhead coherence solution for multiprocessors with pri ate cache memories, in Pr oc. 11th Annual Int'l Symposium on Computer Ar hitectur 1984. [15] D. Culler J. Singh, and A.Gupta, ar allel Computer Ar hitectur e: Har dwar e/Softwar Appr oac Mor gan Kaufmann Publishers, 1998. [16] Http://www .cs.utah.edu/ xiachen

/hl dvt 07 submission.