Inter connectA war Coher ence Pr otocols or Chip Multipr ocessors Liqun Cheng Na een Muralimanohar Karthik Ramani Rajee Balasubramonian John B
100K - views

Inter connectA war Coher ence Pr otocols or Chip Multipr ocessors Liqun Cheng Na een Muralimanohar Karthik Ramani Rajee Balasubramonian John B

Carter School of Computing Uni ersity of Utah le gionnaveenkarthikr r ajee r etr ac csutahedu Abstract Impr vements in semiconductor tec hnolo gy have made it possible to include multiple pr ocessor cor es on single die Chip MultiPr ocessor CMP ar a

Download Pdf

Inter connectA war Coher ence Pr otocols or Chip Multipr ocessors Liqun Cheng Na een Muralimanohar Karthik Ramani Rajee Balasubramonian John B

Download Pdf - The PPT/PDF document "Inter connectA war Coher ence Pr otocols..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Inter connectA war Coher ence Pr otocols or Chip Multipr ocessors Liqun Cheng Na een Muralimanohar Karthik Ramani Rajee Balasubramonian John B"— Presentation transcript:

Page 1
Inter connect-A war Coher ence Pr otocols or Chip Multipr ocessors Liqun Cheng, Na een Muralimanohar Karthik Ramani, Rajee Balasubramonian, John B. Carter School of Computing, Uni ersity of Utah le gion,naveen,karthikr ,r ajee ,r etr ac Abstract Impr vements in semiconductor tec hnolo gy have made it possible to include multiple pr ocessor cor es on single die Chip Multi-Pr ocessor (CMP) ar an attr active hoice for futur billion tr ansistor ar hitectur es due to their low design comple xity high cloc fr equency and high thr ough- put. In typical CMP ar hitectur

the L2 cac he is shar ed by multiple cor es and data coher ence is maintained among private L1s. Coher ence oper ations entail fr equent commu- nication ver global on-c hip wir es. In futur tec hnolo gies, communication between dif fer ent L1s will have significant impact on ver all pr ocessor performance and power con- sumption. On-c hip wir es can be designed to have dif fer ent latency bandwidth, and ener gy pr operties. Lik wise co- her ence pr otocol messa es have dif fer ent latency and band- width needs. pr opose an inter connect composed of wir es with varying latency bandwidth,

and ener gy har acteris- tics, and advocate intellig ently mapping coher ence oper a- tions to the appr opriate wir es. In this paper we pr esent compr ehensive list of tec hniques that allow coher ence pr o- tocols to xploit heter eneous inter connect and valuate subset of these tec hniques to show their performance and power -ef ficiency potential. Most of the pr oposed tec hniques can be implemented with minimum comple xity verhead. 1. Intr oduction One of the greatest bottlenecks to performance in fu- ture microprocessors is the high cost of on-chip commu- nication through global

wires [19 ]. Po wer consumption has also emer ged as first order design metric and wires contrib ute up to 50% of total chip po wer in some proces- sors [32 ]. Most major chip manuf acturers ha already an- nounced plans [20 23 for lar ge-scale chip multi-processors (CMPs). Multi-threaded orkloads that ecute on such processors will xperience high on-chip communication la- tencies and will dissipate significant po wer in interconnects. In the past, only VLSI and circuit designers were concerned with the layout of interconnects for gi en architecture. This ork as supported in part by

NSF grant CCF-0430063 and by Silicon Graphics Inc. Ho we er with communication emer ging as lar ger po wer and performance constraint than computation it may be- come necessary to understand and le erage the properties of the interconnect at higher le el. Exposing wire prop- erties to architects enables them to find creati ays to xploit these properties. This paper presents number of techniques by which coherence traf fic within CMP can be mapped intelligently to dif ferent wire implementations with minor increases in comple xity Such an approach can not only impro performance, ut

also reduce po wer dissi- pation. In typical CMP the L2 cache and lo wer le els of the memory hierarchy are shared by multiple cores [24 41 ]. Sharing the L2 cache allo ws high cache utilization and oids duplicating cache hardw are resources. L1 caches are typically not shared as such an or ganization entails high communication latencies for ery load and store. There are tw major mechanisms used to ensure coherence among L1s in chip multiprocessor The first option emplo ys us connecting all of the L1s and snoop us-based coherence protocol. In this design, ery L1 cache miss results in co-

herence message being broadcast on the global coherence us and other L1 caches are responsible for maintaining alid state for their blocks and responding to misses when necessary The second approach emplo ys centralized di- rectory in the L2 cache that tracks sharing information for all cache lines in the L2. In such directory-based proto- col, ery L1 cache miss is sent to the L2 cache, where fur ther actions are tak en based on that block directory state. Man studies [2 10 21 26 30 ha characterized the high frequenc of cache misses in parallel orkloads and the high impact these misses ha on

total ecution time. On cache miss, ariety of protocol actions are initiated, such as request messages, in alidation messages, interv en- tion messages, data block writebacks, data block transfers, etc. Each of these messages in olv es on-chip communi- cation with latencies that are projected to gro to tens of ycles in future billion transistor architectures [3 ]. Simple wire design strate gies can greatly influence wire properties. or xample, by tuning wire width and spacing, we can design wires with arying latenc and bandwidth properties. Similarly by tuning repeater size and

spacing, we can design wires with arying latenc and en- er gy properties. tak adv antage of VLSI techniques and better match the interconnect design to communication re- quirements, heterogeneous interconnect can be emplo yed, where ery link consists of wires that are optimized for ei- ther latenc ener gy or bandwidth. In this study we xplore optimizations that are enabled when such heterogeneous interconnect is emplo yed for coherence traf fic. or xam- ple, when emplo ying directory-based protocol, on cache write miss, the requesting processor may ha to ait for data from the home

node (a tw hop transaction) and for ac- kno wledgments from other sharers of the block (a three hop transaction). Since the ackno wledgments are on the critical path and ha lo bandwidth needs, the can be mapped to wires optimized for delay while the data block transfer is not on the critical path and can be mapped to wires that are optimized for lo po wer The paper is or ganized as follo ws. discuss related ork in Section 2. Section re vie ws techniques that enable dif ferent wire implementations and the design of hetero- geneous interconnect. Section describes the proposed in- no ations that

map coherence messages to dif ferent on-chip wires. Section quantitati ely aluates these ideas and we conclude in Section 6. 2. Related ork the best of our kno wledge, only three other bodies of ork ha attempted to xploit dif ferent types of intercon- nects at the microarchitecture le el. Beckmann and ood [8 9] propose speeding up access to lar ge L2 caches by in- troducing transmission lines between the cache controller and indi vidual banks. Nelson et al. [39 propose using opti- cal interconnects to reduce inter -cluster latencies in clus- tered architecture where clusters are widely-spaced

in an ef fort to alle viate po wer density recent paper by Balasubramonian et al. [5 introduces the concept of heterogeneous interconnect and applies it for re gister communication within clustered architecture. subset of load/store addresses are sent on lo w-latenc wires to prefetch data out of the L1D cache, while non- critical re gister alues are transmitted on lo w-po wer wires. heterogeneous interconnect similar to the one in [5] has been applied to dif ferent problem domain in this paper The nature of cache coherence traf fic and the optimiza- tions the enable are ery dif ferent

from that of re gister traf fic within clustered microarchitecture. ha also impro ed upon the wire modeling methodology in [5 by modeling the latenc and po wer for all the netw ork com- ponents including routers and latches. Our po wer modeling also tak es into account the additional erhead incurred due to the heterogeneous netw ork, such as additional uf fers within routers. 3. ir Implementations be gin with quick re vie of actors that influence wire properties. It is well-kno wn that the delay of wire is function of its RC time constant (R is resistance and is capacitance).

Resistance per unit length is (approxi- mately) in ersely proportional to the width of the wire [19 ]. Lik wise, fraction of the capacitance per unit length is in ersely proportional to the spacing between wires, and fraction is directly proportional to wire width. These wire properties pro vide an opportunity to design wires that trade of bandwidth and latenc By allocating more metal area per wire and increasing wire width and spacing, the net ef fect is reduction in the RC time constant. This leads to wire design that has orable latenc properties, ut poor bandwidth properties (as fe wer

wires can be accom- modated in fix ed metal area). In certain cases, nearly three-fold reduction in wire latenc can be achie ed, at the xpense of four -fold reduction in bandwidth. Further re- searchers are acti ely pursuing transmission line implemen- tations that enable xtremely lo communication latencies [12 16 ]. Ho we er transmission lines also entail signif- icant metal area erheads in addition to logic erheads for sending and recei ving [8 12 ]. If transmission line im- plementations become cost-ef fecti at future technologies, the represent another attracti wire design point that

can trade of bandwidth for lo latenc Similar trade-of fs can be made between latenc and po wer consumed by wires. Global wires are usually com- posed of multiple smaller se gments that are connected with repeaters [4 ]. The size and spacing of repeaters influences wire delay and po wer consumed by the wire. When smaller and fe wer repeaters are emplo yed, wire delay increases, ut po wer consumption is reduced. The repeater configuration that minimizes delay is typically ery dif ferent from the repeater configuration that minimizes po wer consumption. Banerjee et al. [6] sho

that at 50nm technology e- fold reduction in po wer can be achie ed at the xpense of tw o-fold increase in latenc Thus, by arying properties such as wire width/spacing and repeater size/spacing, we can implement wires with dif- ferent latenc bandwidth, and po wer properties. Consider CMOS process where global inter -core wires are routed on the 8X and 4X metal planes. Note that the primary dif- ferences between minimum-width wires in the 8X and 4X planes are their width, height, and spacing. will refer to these mimimum-width wires as baseline B-W ir es (either 8X-B-W ires or 4X-B-W ires). In

addition to these wires, we will design tw more wire types that may be potentially beneficial (summarized in Figure 1). lo w-latenc L-W ir can be designed by increasing the width and spacing of the wire on the 8X plane (by actor of four). po wer ef ficient PW -W ir is designed by decreasing the number and size of repeaters within minimum-width wires on the
Page 3
PW (power−optimized, low bandwidth) L (delay optimized, B−8X (baseline, low latency) high bandwidth, high delay) B−4X (baseline, high bandwidth) Figure 1. Examples of diff erent wire

implementations. wer optimiz ed wires ha ve wer and smaller repeater s, while band width optimiz ed wires ha ve narr widths and spacing. 4X plane. While traditional architecture ould emplo the entire ailable metal area for B-W ires (either 4X or 8X), we propose the design of heterogeneous intercon- nect, where part of the ailable metal area is emplo yed for B-W ires, part for L-W ires, and part for PW -W ires. Thus, an data transfer has the option of using one of three sets of wires to ef fect the communication. typical composition of heterogeneous interconnect may be as follo ws: 256 B- ires,

512 PW -W ires, 24 L-W ires. In the ne xt section, we will demonstrate ho these options can be xploited to im- pro performance and reduce po wer consumption. will also xamine the comple xity introduced by heterogeneous interconnect. 4. Optimizing Coher ence raffic or each cache coherence protocol, there xist ari- ety of coherence operations with dif ferent bandwidth and latenc needs. Because of this di ersity there are man op- portunities to impro performance and po wer characteris- tics by emplo ying heterogeneous interconnect. The goal of this section is to present comprehensi listing

of such opportunities. focus on protocol-specific optimizations in Section 4.1 and on protocol-independent techniques in Section 4.2. discuss the implementation comple xity of these techniques in Section 4.3. 4.1 Pr otocol-Dependent echniques first xamine the characteristics of operations in both directory-based and snooping us-based coherence proto- cols and ho the can map to dif ferent sets of wires. In us-based protocol, the ability of cache to directly re- spond to another cache request leads to lo L1 cache- to-cache miss latencies. L2 cache latencies are relati ely high as

processor core has to acquire the us before send- ing request to L2. It is dif ficult to support lar ge num- ber of processor cores with single us due to the band- width and electrical limits of centralized us [11 ]. In directory-based design [13 28 ], each L1 connects to the L2 cache through point-to-point link. This design has lo L2 hit latenc and scales better Ho we er each L1 cache- to-cache miss must be forw arded by the L2 cache, which implies high L1 cache-to-cache latencies. The performance comparison between these tw design choices depends on the cache sizes, miss rates, number

of outstanding mem- ory requests, orking-set sizes, sharing beha vior of tar geted benchmarks, etc. Since either option may be attracti to chip manuf acturers, we will consider both forms of coher ence protocols in our study Write-In alidate Dir ectory-Based Pr otocol Write-in alidate directory-based protocols ha been im- plemented in xisting dual-core CMPs [41 and will lik ely be used in lar ger scale CMPs as well. In directory-based protocol, ery cache line has directory where the states of the block in all L1s are stored. Whene er request misses in an L1 cache, coherence message is sent to

the direc- tory at the L2 to check the cache line global state. If there is clean cop in the L2 and the request is READ, it is serv ed by the L2 cache. Otherwise, another L1 must hold an xclusi cop and the READ request is forw arded to the xclusi wner which supplies the data. or WRITE re- quest, if an other L1 caches hold cop of the cache line, coherence messages are sent to each of them requesting that the in alidate their copies. The requesting L1 cache ac- quires the block in xclusi state only after all in alidation messages ha been ackno wledged. Hop imbalance is quite common in

directory-based protocol. xploit this imbalance, we can send critical messages on ast wires to increase performance and send non-critical messages on slo wires to sa po wer or the sak of this discussion, we assume that the hop latencies of dif ferent wires are in the follo wing ratio: L-wire B-wire PW -wire :: Pr oposal I: Read xclusive equest for bloc in shar ed state In this case, the L2 cache cop is clean, so it pro vides the data to the requesting L1 and in alidates all shared copies. When the requesting L1 recei es the reply message from the L2, it collects in alidation ackno wledgment

messages from the other L1s before returning the data to the processor core Figure depicts all generated messages. The reply message from the L2 requires only one hop, while the in alidation process requires tw hops an xam- ple of hop imbalance. Since there is no benefit to recei ving Some coherence protocols may not impose all of these constraints, thereby de viating from sequentially consistent memory model.
Page 4
L2 cache and directory acknowledgement back to Processor 2 Processor 1 Cache 1 Cache 2 Shared state in Cache 2. Processor 1 attempts write. message to Cache 2.

Sends Rd−Exc to directory. Directory sends invalidate Directory finds block in Sends clean copy of cache block to Cache 1. Cache 2 sends invalidate Cache 1. Figure 2. Read xc lusive request or shared loc in MESI pr otocol the cache line early latencies for each hop can be chosen so as to equalize communication latenc for the cache line and the ackno wledgment messages. Ackno wledgment mes- sages include identifiers so the can be matched against the outstanding request in the L1 MSHR. Since there are only fe outstanding requests in the system, the identifier re- quires fe

bits, allo wing the ackno wledgment to be trans- ferred on fe lo w-latenc L-W ires. Simultaneously the data block transmission from the L2 can happen on lo w- po wer PW -W ires and still finish before the arri al of the ac- kno wledgments. This strate gy impro es performance (be- cause ackno wledgments are often on the critical path) and reduces po wer consumption (because the data block is no transferred on po wer -ef ficient wires). While circuit design- ers ha frequently emplo yed dif ferent types of wires within circuit to reduce po wer dissipation without xtending the critical

path, the proposals in this paper represent some of the first attempts to xploit wire properties at the architec- tural le el. Pr oposal II: Read equest for bloc in xclusive state In this case, the alue in the L2 is lik ely to be stale and the follo wing protocol actions are tak en. The L2 cache sends speculati data reply to the requesting L1 and forw ards the read request as an interv ention message to the xclusi wner If the cache cop in the xclusi wner is clean, an ackno wledgment message is sent to the requesting L1, indi- cating that the speculati data reply from the L2 is alid. If

the cache cop is dirty response message with the latest data is sent to the requesting L1 and write-back message is sent to the L2. Since the requesting L1 cannot proceed until it recei es message from the xclusi wner the speculati data reply from the L2 (a single hop transfer) can be sent on slo wer PW -W ires. The forw arded request to the xclusi wner is on the critical path, ut includes the block address. It is therefore not eligible for transfer on lo w-bandwidth L-W ires. If the wner cop is in the xclu- si clean state, short ackno wledgment message to the re- questor can be sent on L-W

ires. If the wner cop is dirty the cache block can be sent er B-W ires, while the lo pri- ority writeback to the L2 can happen on PW -W ires. ith the abo mapping, we accelerate the critical path by using aster L-W ires, while also lo wering po wer consumption by sending non-critical data on PW -W ires. The abo protocol actions apply en in the case when read-e xclusi request is made for block in the xclusi state. Pr oposal III: CK messa es When the directory state is usy incoming requests are often CK ed by the home directory i.e., ne gati ac- kno wledgment is sent to the requester rather than

uf fering the request. ypically the requesting cache controller re- issues the request and the request is serialized in the or der in which it is actually accepted by the directory CK message can be matched by comparing the request id (MSHR inde x) rather than the full address, so CK is eligible for transfer on lo w-bandwidth L-W ires. If load at the home directory is lo it will lik ely be able to serv the request when it arri es again, in which case sending the CK on ast L-W ires can impro performance. In con- trast, when load is high, frequent back of f-and-retry ycles are xperienced. In

this case, ast CKs only increase traf fic le els without pro viding an performance benefit. In this case, in order to sa po wer CKs can be sent on PW -W ires. Pr oposal IV Unbloc and write contr ol messa es Some protocols [36 emplo unblock and write control mes- sages to reduce implementation comple xity or ery read transaction, processor first sends request message that changes the L2 cache state into transient state. After recei ving the data reply it sends an unblock message to change the L2 cache state back to stable state. Simi- larly write control messages are used to

implement 3- phase writeback transaction. processor first sends con- trol message to the directory to order the writeback message with other request messages. After recei ving the writeback response from the directory the processor sends the data. This oids race condition in which the processor sends the writeback data while request is being forw arded to it. Sending unblock messages on L-W ires can impro perfor mance by reducing the time cache lines are in usy states. Write control messages (writeback request and writeback grant) are not on the critical path, although the are also el-

igible for transfer on L-W ires. The choice of sending write- back control messages on L-W ires or PW -W ires represents po wer -performance trade-of f. Write-In alidate Bus-Based Pr otocol ne xt xamine techniques that apply to us-based snooping protocols. Pr oposal Signal wir es In us-based system, three wired-OR signals are typically emplo yed to oid in olving the lo wer/slo wer memory hier archy [15 ]. of these signals are responsible for report- ing the state of snoop results and the third indicates that the
Page 5
snoop result is alid. The first signal is asserted when an

L1 cache, besides the requester has cop of the block. The second signal is asserted if an cache has the block in xclu- si state. The third signal is an inhibit signal, asserted until all caches ha completed their snoop operations. When the third signal is asserted, the requesting L1 and the L2 can safely xamine the other tw signals. Since all of these signals are on the critical path, implementing them using lo w-latenc L-W ires can impro performance. Pr oposal VI: oting wir es Another design choice is whether to use cache-to-cache transfers if the data is in the shared state in cache. The

Silicon Graphics Challenge [17 and the Sun Enterprise use cache-to-cache transfers only for data in the modified state, in which case there is single supplier On the other hand, in the full Illinois MESI protocol, block can be preferen- tially retrie ed from another cache rather than from mem- ory Ho we er when multiple caches share cop “v ot- ing mechanism is required to decide which cache will sup- ply the data, and this oting mechanism can benefit from the use of lo latenc wires. 4.2 Pr otocol-independent echniques Pr oposal VII: Narr ow Bit-W idth Oper ands for Sync hr o-

nization ariables Synchronization is one of the most important actors in the performance of parallel application. Synchronization is not only often on the critical path, ut it also contrib utes lar ge percentage (up to 40%) of coherence misses [30 ]. Locks and barriers are the tw most widely used synchro- nization constructs. Both of them use small inte gers to im- plement mutual xclusion. Locks often toggle the synchro- nization ariable between zero and one, while barriers often linearly increase barrier ariable from zero to the number of processors taking part in the barrier operation. Such

data transfers ha limited bandwidth needs and can benefit from using L-W ires. This optimization can be further xtended by xamining the general problem of cache line compaction. or xam- ple, if cache line is comprised mostly of bits, tri vial data compaction algorithms may reduce the bandwidth needs of the cache line, allo wing it to be transferred on L-W ires in- stead of B-W ires. If the wire latenc dif ference between the tw wire implementations is greater than the delay of the compaction/de-compaction algorithm, performance im- pro ements are possible. Pr oposal VIII: Assigning

Writebac Data to PW -W ir es Writeback data transfers result from cache replacements or xternal request/interv ention messages. Since writeback messages are rarely on the critical path, assigning them to PW -W ires can sa po wer without incurring significant per formance penalties. Pr oposal IX: Assigning Narr ow Messa es to L-W ir es Coherence messages that include the data block address or the data block itself are man bytes wide. Ho we er man other messages, such as ackno wledgments and CKs, do not include the address or data block and only contain con- trol information

(source/destination, message type, MSHR id, etc.). Such narro messages can be al ays assigned to lo latenc L-W ires to accelerate the critical path. 4.3 Implementation Complexity 4.3.1 Ov erhead in Heter ogeneous Inter connect Imple- mentation In con entional multiprocessor interconnect, subset of wires are emplo yed for addresses, subset for data, and subset for control signals. Ev ery bit of communication is mapped to unique wire. When emplo ying heteroge- neous interconnect, communication bit can map to mul- tiple wires. or xample, data returned by the L2 in re- sponse to read-e xclusi

request may map to B-W ires or PW -W ires depending on whether there are other sharers for that block Pr oposal ). Thus, ery wire must be associ- ated with multiple xor and de-multiple xor The entire netw ork operates at the same fix ed clock fre- quenc which means that the number of latches within v- ery link is function of the link latenc Therefore, PW ires ha to emplo additional latches, relati to the baseline B-W ires. Dynamic po wer per latch at 5GHz and 65nm technology is calculated to be 0.1mW while leakage po wer per latch equals 19.8 [25 ]. The po wer per unit length for each

wire is computed in the ne xt section. Po wer erheads due to these latches for dif ferent wires are tab- ulated in able 1. Latches impose 2% erhead within B-W ires, ut 13% erhead within PW -W ires. The proposed model also introduces additional comple x- ity in the routing logic. The base case router emplo ys cross-bar switch and 8-entry message uf fers at each input port. Whene er message arri es, it is stored in the in- put uf fer and routed to an allocator that locates the output port and transfers the message. In case of heterogeneous model, three dif ferent uf fers are required at each

port to store L, B, and PW messages separately In our simulations we emplo three 4-entry message uf fers for each port. The size of each uf fer is proportional to the flit size of the cor responding set of wires. or xample, set of 24 L-W ires emplo ys 4-entry message uf fer with ord size of 24 bits. or po wer calculations we ha also included the fix ed additional erhead associated with these small uf fers as opposed to single lar ger uf fer emplo yed in the base case. In our proposed processor model, the dynamic characteri- zation of messages happens only in the processors and in-

termediate netw ork routers cannot re-assign message to dif ferent set of wires. While this may ha ne gati ef fect on performance in highly utilized netw ork, we chose to eep the routers simple and not implement such feature.
Page 6
ire ype Po wer/Length Latch Po wer Latch Spacing otal Po wer/10mm mW/mm mW/latch mm mW/10mm B-W ire 8X plane 1.4221 0.119 5.15 14.46 B-W ire 4X plane 1.5928 0.119 3.4 16.29 L-W ire 8X plane 0.7860 0.119 9.8 7.80 PW -wire 4X plane 0.4778 0.119 1.7 5.48 ab le 1. wer haracteristics of diff erent wire implementations. For calculating the po wer/length,

activity factor (described in ab le 3) is assumed to be 0.15. The abo ve latc spacing alues are or 5GHz netw ork. or netw ork emplo ying virtual channel flo control, each set of wires in the heterogeneous netw ork link is treated as separate physical channel and the same number of vir tual channels are maintained per physical channel. There- fore, the heterogeneous netw ork has lar ger total number of virtual channels and the routers require more state fields to eep track of these additional virtual channels. sum- marize, the additional erhead introduced by the heteroge- neous

model comes in the form of potentially more latches and greater routing comple xity 4.3.2 Ov erhead in Decision Pr ocess The decision process in selecting the right set of wires is minimal. or xample, in Pr oposal an OR function on the directory state for that block is enough to select either B- or PW -W ires. In Pr oposal II the decision process in olv es check to determine if the block is in the xclusi state. support Pr oposal III we need mechanism that tracks the le el of congestion in the netw ork (for xample, the num- ber of uf fered outstanding messages). There is no decision process in

olv ed for Pr oposals IV VI and VIII Pr o- posals VII and IX require logic to compute the width of an operand, similar to logic used in the Po werPC 603 [18 to determine the latenc for inte ger multiply 4.3.3 Ov erhead in Cache Coher ence Pr otocols Most coherence protocols are already designed to be rob ust in the ace of ariable delays for dif ferent messages. or protocols relying on message order within virtual channel, each virtual channel can be made to consist of set of L-, B-, and PW -message uf fers. multiple xor can be used to acti ate only one type of message uf fer at time to ensure

correctness. or other protocols that are designed to handle message re-ordering within virtual channel, we propose to emplo one dedicated virtual channel for each set of wires to fully xploit the benefits of heterogeneous interconnect. In all proposed inno ations, data pack et is not distrib uted across dif ferent sets of wires. Therefore, dif ferent compo- nents of an entity do not arri at dif ferent periods of time, thereby eliminating an timing problems. It may be orth considering sending the critical ord of cache line on L- ires and the rest of the cache line on PW -W ires. Such

proposal may entail non-tri vial comple xity to handle corner cases and is not discussed further in this paper In snooping us-based coherence protocol, transactions are serialized by the order in which addresses appear on the us. None of our proposed inno ations for snooping pro- tocols af fect the transmission of address bits (address bits are al ays transmitted on B-W ires), so the transaction seri- alization model is preserv ed. 5. Results 5.1 Methodology 5.1.1 Simulator simulate 16-core CMP with the irtutech Simics full- system functional ecution-dri en simulator [33 and timing

infrastructure GEMS [34 ]. GEMS can simulate both in-order and out-of-order processors. In most studies, we use the in-order blocking processor model pro vided by Sim- ics to dri the detailed memory model (Ruby) for ast sim- ulation. Ruby implements one-le el MOESI directory cache coherence protocol with migratory sharing optimiza- tion [14 40 ]. All processor cores share non-inclusi L2 cache, which is or ganized as non-uniform cache ar chitecture (NUCA) [22 ]. Ruby can also be dri en by an out-of-order processor module called Opal, and we report the impact of the processor cores on the

heterogeneous in- terconnect in Section 5.3. Opal is timing-first simulator that implements the performance sensiti aspects of an out of order processor ut ultimately relies on Simics to pro- vide functional correctness. configure Opal to model the processor described in able and use an aggressi implementation of sequential consistenc test our ideas, we emplo orkload consisting of all programs from the SPLASH-2 [43 benchmark suite. The programs were run to completion, ut all xperimental re- sults reported in this paper are for the parallel phases of these applications. use def ault

input sets for most pro- grams xcept ft and radix. Since the def ault orking sets of these tw programs are too small, we increase the orking set of ft to 1M data points and that of radix to 4M ys. 5.1.2 Inter connect wer/Delay/Ar ea Models This section describes details of the interconnect architec- ture and the methodology we emplo for calculating the area, delay and po wer alues of the interconnect. con- sider 65nm process technology and assume 10 metal lay- ers, layers in 1X plane and layers, in each 2X, 4X, and 8X plane [25 ]. or most of our study we emplo cross- bar based hierarchical

interconnect structure to connect the cores and L2 cache (Figure 3(a)), similar to that in SGI
Page 7
arameter alue arameter alue number of cores 16 clock frequenc 5GHz pipeline width 4-wide fetch and issue pipeline stages 11 cache block size 64 Bytes split L1 cache 128KB, 4-w ay shared L2 cache 8MBytes, 4-w ay 16-banks non-inclusi NUCA memory/dir controllers 30 ycles interconnect link latenc ycles (one-w ay) for the baseline 8X-B-W ires DRAM latenc 400 ycles memory bank capacity GByte per bank latenc to mem controller 100 ycles ab le 2. System configuration. 75−bytes

Processor L2 Cache b) Links with different sets of wires a) Hierarchical network topology for 16−core CMP Crossbar B−Wire L−Wire PW−Wire Figure 3. Inter connect model used or coherence transactions in sixteen-core CMP NUMALink-4 [1 ]. The ef fect of other interconnect topolo- gies is discussed in our sensiti vity analysis. In the base case, each link in Figure 3(a) consists of (in each direction) 64-bit address wires, 64-byte data wires, and 24-bit control wires. The control signals carry source, destination, signal type, and Miss Status Holding Re gister (MSHR) id. All

wires are fully pipelined. Thus, each link in the interconnect is capable of transferring 75 bytes in each direction. Error Correction Codes (ECC) account for another 13% erhead in addition to the abo mentioned wires [38 ]. All the wires of the base case are routed as B-W ires in the 8X plane. As sho wn in Figure 3(b), the proposed heterogeneous model emplo ys additional wire types within each link. In addition to B-W ires, each link includes lo w-latenc lo w-bandwidth L-W ires and high-bandwidth, high-latenc po wer -ef ficient, PW -W ires. The number of L- and PW ires that can be emplo

yed is function of the ailable metal area and the needs of the coherence protocol. In order to match the metal area with the baseline, each uni- directional link within the heterogeneous model is designed to be made up of 24 L-W ires, 512 PW -W ires, and 256 B- ires (the base case has 600 B-W ires, not counting ECC). In ycle, three messages may be sent, one on each of the three sets of wires. The bandwidth, delay and po wer calcu- lations for these wires are discussed subsequently able summarizes the dif ferent types of wires and their area, delay and po wer characteristics. The area erhead of

the interconnect can be mainly attrib uted to repeaters and wires. use wire width and spacing (based on ITRS pro- jections) to calculate the ef fecti area for minimum-width wires in the 4X and 8X plane. L-W ires are designed to oc- cup four times the area of minimum-width 8X-B-W ires. Delay Our wire model is based on the RC models proposed in [6 19 37]. The delay per unit length of wire with optimally placed repeaters is gi en by equation (1), where ir is resistance per unit length of the wire, ir is ca- pacitance per unit length of the wire, and is the an-out of one delay: Latency ir 13 ir ir

(1) ir is in ersely proportional to wire width, while ir depends on the follo wing three components: (i) fringing capacitance that accounts for the capacitance be- tween the side all of the wire and substrate, (ii) parallel plate capacitance between the top and bottom layers of the metal that is directly proportional to the width of the metal, (iii) parallel plate capacitance between the adjacent metal wires that is in ersely proportional to the spacing between the wires. The ir alue for the top most metal layer at 65nm technology is gi en by equation (2) [37 ]. ir 065 057 015 =S = (2) deri

relati delays for dif ferent types of wires by tun- ing width and spacing in the abo equations. ariety of width and spacing alues can allo L-W ires to yield tw o- fold latenc impro ement at four -fold area cost, relati to 8X-B-W ires. In order to reduce po wer consumption, we
Page 8
ire ype Relati Latenc Relati Area Dynamic Po wer (W/m) Static Po wer ir eW idth spacing Switching actor W/m B-W ire (8X plane) 65 1.0246 B-W ire (4X plane) 1.1578 L-W ire (8X plane) 46 0.5670 PW -W ire (4X plane) 87 0.3074 ab le 3. Area, dela and po wer haracteristics of diff erent wire implementations.

Component Ener gy/transaction (J) Arbiter 6.43079e-14 Crossbar 5.32285e-12 Buf fer read operation 1.23757e-12 Buf fer write operation 1.73723e-12 ab le 4. Ener gy consumed (max) arbiter s, uff er s, and cr ossbar or 32-b yte transf er selected wire implementation where the L-W ire width as twice that of the minimum width and the spacing as six times as much as the minimum spacing for the 8X metal plane. wer The total po wer consumed by wire is the sum of three components (dynamic, leakage, and short-circuit po wer). Equations deri ed by Banerjee and Mehrotra [6 are used to deri the po wer

consumed by L- and B-W ires. These equations tak into account optimal repeater size/spacing and wire width/spacing. PW -W ires are designed to ha twice the delay of 4X-B-W ires. At 65nm technology for delay penalty of 100%, smaller and widely-spaced repeaters enable po wer reduction by 70% [6]. Routers Crossbars, uf fers, and arbiters are the major contrib u- tors for router po wer [42 ]: outer buf er cr ossbar ar biter (3) The capacitance and ener gy for each of these components is based on analytical models proposed by ang et al. [42 ]. model 5x5 matrix crossbar that emplo ys tristate uf fer

connector As described in Section 4.3, uf fers are modeled for each set of wires with ord size corresponding to flit size. able sho ws the peak ener gy consumed by each component of the router for single 32-byte transac- tion. 5.2 Results or our simulations, we restrict ourselv es to directory- based protocols. model the ef fect of proposals pertain- ing to such protocol: I, III, IV VIII, IX Proposal-II optimizes speculati reply messages in MESI protocols, which are not implemented within GEMS MOESI pro- tocol. Ev aluations in olving compaction of cache blocks (Proposal VII) is left as

future ork. Figure 4. Speedup of heter og eneous inter con- nect Figure sho ws the ecution time in ycles for SPLASH2 programs. The first bar sho ws the performance of the baseline or ganization that has one interconnect layer of 75 bytes, composed entirely of 8X-B-W ires. The sec- ond sho ws the performance of the heterogeneous intercon- nect model in which each link consists of 24-bit L-wires, 32-byte B-wires, and 64-byte PW -wires. Programs such as LU-Non-continuous, Ocean-Non-continuous, and Raytrac- ing yield significant impro ements in performance. These performance numbers

can be analyzed with the help of Fig- ure that sho ws the distrib ution of dif ferent transfers that happen on the interconnect. ransfers on L-W ires can ha huge impact on performance, pro vided the are on the program critical path. LU-Non-continuous, Ocean-Non- continuous, Ocean-Continuous, and Raytracing xperience the most transfers on L-W ires. But the performance im- pro ement of Ocean-Continuous is ery lo compared to other benchmarks. This can be attrib uted to the act that Ocean-Continuous incurs the most L2 cache misses and is mostly memory bound. The transfers on PW -W ires ha ne

gligible ef fect on performance for all benchmarks. This is because PW -W ires are emplo yed only for writeback trans- fers that are al ays of the critical path. On erage, we observ 11.2% impro ement in performance, compared to the baseline, by emplo ying heterogeneity within the net- ork. Proposals I, III, IV and IX xploit L-W ires to send small messages within the protocol, and contrib ute 2.3, 0, 60.3,
Page 9
Figure 5. Distrib ution of messa es on the heter o- eneous netw ork. B-Wire transf er are lassified as Request and Data. and 37.4 per cent, respecti ely to total L-W

ire traf fic. per -benchmark breakdo wn is sho wn in Figure 6. Proposal-I optimizes the case of read xclusi request for block in shared state, which is not ery common in the SPLASH2 benchmarks. xpect the impact of Proposal-I to be much higher in commercial orkloads where cache-to- cache misses dominate. Proposal-III and Proposal-IV im- pact CK, unblocking, and writecontrol messages. These messages are widely used to reduce the implementation comple xity of coherence protocols. In GEMS MOESI pro- tocol, CK messages are only used to handle the race con- dition between tw write-back

messages, which are ne gligi- ble in our study (causing the zero contrib ution of Proposal- III). Instead, the protocol implementation hea vily relies on unblocking and writecontrol messages to maintain the order between read and write transactions, as discussed in Sec- tion 4.1. The frequenc of occurrence of CK, unblock- ing, and writecontrol messages depends on the protocol im- plementation, ut we xpect the sum of these messages to be relati ely constant in dif ferent protocols and play an im- portant role in L-wire optimizations. Proposal-IX includes all other ackno wledgment messages

eligible for transfer on L-W ires. observ ed that the combination of proposals I, III, IV and IX caused performance impro ement more than the sum of impro ements from each indi vidual proposal. par allel benchmark can be di vided into number of phases by synchronization ariables (barriers), and the ecution time of each phase can be defined as the longest time an thread spends from one barrier to the ne xt. Optimizations applied to single thread may ha no ef fect if there are other threads on the critical path. Ho we er dif ferent optimiza- tion may apply to the threads on the critical

path, reduce their ecution time, and xpose the performance of other Figure 6. Distrib ution of L-messa transf er acr oss diff erent pr oposals. threads and the optimizations that apply to them. Since dif- ferent threads tak dif ferent data paths, most parallel appli- cations sho nontri vial orkload imbalance [31 ]. There- fore, emplo ying one proposal might not speedup all threads on the critical path, ut emplo ying all applicable proposals can probably optimize threads on ery path, thereby reduc- ing the total barrier to barrier time. Figure sho ws the impro ement in netw ork ener gy due to

the heterogeneous interconnect model. The first bar sho ws the reduction in netw ork ener gy and the sec- ond bar sho ws the impro ement in the erall processor ner el ay metric. Other metrics in the space can also be computed with data in Figures and 4. calculate we assume that the total po wer consumption of the chip is 200W of which the netw ork po wer accounts for 60W The ener gy impro ement in the heterogeneous case comes from both and PW transfers. Man control messages that are sent on B-W ires in the base case are sent on L-W ires in the heterogeneous case. As per able 3, the ener

gy consumed by an L-W ire is less than the ener gy consumed by B-W ire. But due to the small sizes of these messages, the contrib ution of L-messages to the to- tal ener gy sa vings is ne gligible. Ov erall, the heterogeneous netw ork results in 22% sa ving in netw ork ener gy and 30% impro ement in 5.3 Sensiti vity Analysis In this sub-section, we discuss the impact of proces- sor cores, link bandwidth, routing algorithm, and netw ork topology on the heterogeneous interconnect. Out-of-order/In-order Pr ocessors test our ideas with an out-of-order processor we con- figure Opal to model

the processor described in able and only report the results of the first 100M instructions in the
Page 10
Figure 7. Impr vement in link ener gy and parallel sections Figure sho ws the performance speedup of the hetero- geneous interconnect er the baseline. All benchmarks x- cept Ocean-Noncontinuous demonstrate dif ferent de grees of performance impro ement, which leads to an erage speedup of 9.3%. The erage performance impro ement is less than what we observ in system emplo ying in-order cores (11.2%). This can be attrib uted to the greater toler ance that an out-of-order

processor has to long instruction latencies. Link Band width The heterogeneous netw ork poses more constraints on the type of messages that can be issued by processor in ycle. It is therefore lik ely to not perform ery well in bandwidth-constrained system. erify this, we mod- eled base case where ery link has only 80 8X-B-W ires and heterogeneous case where ery link is composed of 24 L-W ires, 24 8X-B-W ires, and 48 PW -W ires (almost twice the metal area of the ne base case). Benchmarks with higher netw ork utilizations suf fered significant perfor mance losses. In our xperiments

raytracing has the maxi- mum messages/c ycle ratio and the heterogeneous case suf- fered 27% performance loss, compared to the base case (in spite of ha ving twice the metal area). The heteroge- neous interconnect performance impro ement for Ocean Non-continuous and LU Non-continuous is 12% and 11%, as against 39% and 20% in the high-bandwidth simulations. Ov erall, the heterogeneous model performed 1.5% orse than the base case. Routing Algorithm Our simulations thus ar ha emplo yed adapti routing within the netw ork. Adapti routing alle viates the con- Simulating the entire program tak es

nearly week and there xist no ef fecti toolkits to ˛nd the representati phases for parallel benchmarks. LU-Noncontinuous and Radix were not compatible with the Opal timing module. Figure 8. Speedup of heter og eneous inter con- nect when driven OoO cores (Opal and Rub y) tention problem by dynamically routing messages based on the netw ork traf fic. found that deterministic routing de- graded performance by about 3% for most programs for sys- tems with the baseline and with the heterogeneous netw ork. Raytracing is the only benchmark that incurs significant performance penalty

of 27% for both netw orks. Netw ork opology Our def ault interconnect thus ar as tw o-le el tree based on SGI NUMALink-4 [1 ]. test the sensiti vity of our results to the netw ork topology we also xamine 2D- torus interconnect resembling that in the Alpha 21364 [7 ]. As sho wn in Figure 9, each router connects to links that connect to neighbors in the torus, and wraparound links are emplo yed to connect routers on the boundary Our proposed mechanisms sho much less performance benefit (1.3% on erage) in the 2D torus interconnect than in the tw o-le el tree interconnect. The main reason is

that our decision process in selecting the right set of wires cal- culates hop imbalance at the coherence protocol le el with- out considering the physical hops message tak es on the mapped topology or xample, in 3-hop transaction as sho wn in Figure 2, the one-hop message may tak physi- cal hops while the 2-hop message may also tak physical hops. In this case, sending the 2-hop message on the L- ires and the one-hop message on the PW -W ires will actu- ally lo wer performance. This is not first-order ef fect in the tw o-le el tree inter connect, where most hops tak physical hops. Ho we

er the erage distance between tw processors in the 2D torus interconnect is 2.13 physical hops with standard de viation of 0.92 hops. In an interconnect with such high standard de- viation, calculating hop imbalance based on protocol hops is inaccurate. or future ork, we plan to de elop more accurate decision process that considers source id, destina- tion id, and interconnect topology to dynamically compute
Page 11
(a) 2D orus topology (b) Heterogeneous interconnect speedup. Figure 9. Results or the 2D orus. an optimal mapping to wires. 6. Conclusions and Futur ork Coherence traf

fic in chip multiprocessor has di erse needs. Some messages can tolerate long latencies, while others are on the program critical path. Further messages ha aried bandwidth demands. On-chip global wires can be designed to optimize latenc bandwidth, or po wer adv ocate partitioning ailable metal area across dif ferent wire implementations and intelligently mapping data to the set of wires best suited for its communication. This paper presents numerous no el techniques that can xploit het- erogeneous interconnect to simultaneously impro perfor mance and reduce po wer consumption. Our

aluation of subset of the proposed techniques sho ws that lar ge fraction of messages ha lo bandwidth needs and can be transmitted on lo latenc wires, thereby yielding performance impro ement of 11.2%. At the same time, 22.5% reduction in interconnect ener gy is observ ed by transmitting non-critical data on po wer -ef ficient wires. The comple xity cost is mar ginal as the mapping of mes- sages to wires entails simple logic. or future ork, we plan to strengthen our decision pro- cess in calculating the hop imbalance based on the topol- ogy of the interconnect. will also aluate the

poten- tial of other techniques listed in this paper There may be se eral other applications of heterogeneous interconnects within CMP or xample, in the Dynamic Self In vali- dation scheme proposed by Lebeck et al. [29 ], the self- in alidate [27 29 messages can be ef fected through po wer ef ficient PW -W ires. In processor model implementing to- en coherence, the lo w-bandwidth tok en messages [35 are often on the critical path and thus, can be ef fected on L- ires. recent study by Huh et al. [21 reduces the fre- quenc of alse sharing by emplo ying incoherent data. or cache lines suf

fering from alse sharing, only the sharing states need to be propagated and such messages are good match for lo w-bandwidth L-W ires. Refer ences [1] SGI Altix 3000 Con˛guration. “http://www ers/altix/con˛gs.html”. [2] M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato. The Use of Prediction for Accelerating Upgrade Misses in CC- NUMA Multiprocessors. In Pr oceedings of CT -11 2002. [3] Agarw al, M. Hrishik esh, S. eckler and D. Bur ger Clock Rate ersus IPC: The End of the Road for Con entional Mi- croarchitectures. In Pr oceedings of ISCA-27 pages 248± 259,

June 2000. [4] H. Bak oglu. Cir cuits, Inter connections, and ac ka ging for VLSI Addison-W esle 1990. [5] R. Balasubramonian, N. Muralimanohar K. Ramani, and enkatachalapathy Microarchitectural ire Manage- ment for Performance and Po wer in artitioned Architec- tures. In Pr oceedings of HPCA-11 February 2005. [6] K. Banerjee and A. Mehrotra. Po wer -optimal Repeater Insertion Methodology for Global Interconnects in Nanome- ter Designs. IEEE ansactions on Electr on De vices 49(11):2001±2007, No ember 2002. [7] Bannon. Alpha 21364: Scalable Single-Chip SMP. Oc- tober 1998. [8] B. Beckmann and

D. ood. TLC: ransmission Line Caches. In Pr oceedings of MICR O-36 December 2003. [9] B. Beckmann and D. ood. Managing ire Delay in Lar ge Chip-Multiprocessor Caches. In Pr oceedings of MICR O-37 December 2004. [10] E. E. Bilir R. M. Dickson, Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. ood. Multicast Snooping: Ne Coherence Method using Multicast Address Net- ork. SIGARCH Comput. Ar hit. Ne ws pages 294±304, 1999. [11] A. Briggs, M. Cekleo K. Creta, M. Khare, S. ulick, A. umar L. Looi, C. Natarajan, S. Radhakrishnan, and L. Rankin. Intel 870: Building Block for Cost-Ef fecti e, Scalable

Serv ers. IEEE Micr 22(2):36±47, 2002.
Page 12
[12] R. Chang, N. al alkar C. ue, and S. ong. Near Speed- of-Light Signaling Ov er On-Chip Electrical Interconnects. IEEE ournal of Solid-State Cir cuits 38(5):834±838, May 2003. [13] Corporate Institute of Electrical and Electronics Engineers, Inc. Staf f. IEEE Standar for Scalable Coher ent Interface Science: IEEE Std. 1596-1992 1993. [14] A. Cox and R. wler Adapti Cache Coherenc for De- tecting Migratory Shared Data. pages 98±108, May 1993. [15] D. E. Culler and J. Singh. ar allel Computer Ar hitec- tur e: Har dwar e/softwar Appr oac

Mor gan Kaufmann Publishers, Inc, 1999. [16] Dally and J. Poulton. Digital System Engineering Cam- bridge Uni ersity Press, Cambridge, UK, 1998. [17] M. Galles and E. illiams. Performance Optimizations, Im- plementation, and eri˛cation of the SGI Challenge Multi- processor In HICSS (1) pages 134±143, 1994. [18] G. Gerosa and et al. 2.2 80 MHz Superscalar RISC Microprocessor IEEE ournal of Solid-State Cir cuits 29(12):1440±1454, December 1994. [19] R. Ho, K. Mai, and M. Horo witz. The Future of ires. Pr o- ceedings of the IEEE ol.89, No.4, April 2001. [20] Hofstee. Po wer Ef ˛cient

Processor Architecture and The Cell Processor In Pr oceedings of HPCA-11 (Industrial Ses- sion) February 2005. [21] J. Huh, J. Chang, D. Bur ger and G. S. Sohi. Coherence Decoupling: Making Use of Incoherence. In Pr oceedings of ASPLOS-XI pages 97±106, 2004. [22] J. Huh, C. Kim, H. Sha˛, L. Zhang, D. Bur ger and S. eckler NUCA Substrate for Fle xible CMP Cache Shar ing. In ICS ’05: Pr oceedings of the 19th annual interna- tional confer ence on Super computing pages 31±40, Ne ork, NY USA, 2005. CM Press. [23] ongetira. 32-W ay Multithreaded SP ARC Pro- cessor In Pr oceedings of Hot Chips

16 2004. ). [24] K. Kre well. UltraSP ARC IV Mirrors Predecessor: Sun Builds Dualcore Chip in 130nm. Micr opr ocessor Report pages 1,5±6, No 2003. [25] R. umar Zyuban, and D. ullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Ov erheads, and Scaling. In Pr oceedings of the 32nd ISCA June 2005. [26] A.-C. Lai and B. alsa˛. Memory Sharing Predictor: The to Speculati Coherent DSM. In Pr oceedings of ISCA-26 1999. [27] A.-C. Lai and B. alsa˛. Selecti e, Accurate, and imely Self-In alidation Using Last-T ouch Prediction. In Pr

oceed- ings of ISCA-27 pages 139±148, 2000. [28] J. Laudon and D. Lenoski. The SGI Origin: ccNUMA Highly Scalable Serv er. In Pr oceedings of ISCA-24 pages 241±251, June 1997. [29] A. R. Lebeck and D. A. ood. Dynamic Self-In alidation: Reducing Coherence Ov erhead in Shared-Memory Multi- processors. In Pr oceedings of ISCA-22 pages 48±59, 1995. [30] K. M. Lepak and M. H. Lipasti. emporally Silent Stores. In Pr oceedings of ASPLOS-X pages 30±41, 2002. [31] J. Li, J. Martinez, and M. C. Huang. The Thrifty Barrier: Ener gy-A are Synchronization in Shared-Memory Multi- processors. In HPCA ’04: Pr

oceedings of the 10th Interna- tional Symposium on High erformance Computer Ar hitec- tur page 14, ashington, DC, USA, 2004. IEEE Computer Society [32] N. Magen, A. olodn U. eiser and N. Shamir Intercon- nect Po wer Dissipation in Microprocessor In Pr oceedings of System Le vel Inter connect Pr ediction February 2004. [33] Magnusson, M. Christensson, J. Eskilson, D. orsgren, G. Hallber g, J. Hogber g, Larsson, A. Moestedt, and B. erner Simics: Full System Simulation Platform. IEEE Computer 35(2):50±58, February 2002. [34] M. Martin, D. Sorin, B. Beckmann, M. Marty M. Xu, A. Alameldeen, K.

Moore, M. Hill, and D. ood. Multi- acet General Ex ecution-Dri en Multiprocessor Simulator (GEMS) oolset. Computer Ar hitectur Ne ws 2005. [35] M. M. K. Martin, M. D. Hill, and D. A. ood. ok en Co- herence: Decoupling Performance and Correctness. In Pr o- ceedings of ISCA-30 2003. [36] M. R. Marty J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. ood. Impro ving Multiple-CMP Systems Using ok en Coherence. In HPCA pages 328±339, 2005. [37] M. L. Mui, K. Banerjee, and A. Mehrotra. Global In- terconnect Optimization Scheme for Nanometer Scale VLSI ith Implications for Latenc

Bandwidth, and Po wer Dis- sipation. IEEE ansactions on Electr onic De vices ol.51, No.2, February 2004. [38] S. Mukherjee, J. Emer and S. Reinhardt. The Soft Error Problem: An Architectural Perspecti e. In Pr oceedings of HPCA-11 (Industrial Session) February 2005. [39] N. Nelson, G. Briggs, M. Haurylau, G. Chen, H. Chen, D. Albonesi, E. Friedman, and auchet. Alle viating Thermal Constraints while Maintaining Performance ia Silicon-Based On-Chip Optical Interconnects. In Pr oceed- ings of orkshop on Unique Chips and Systems March 2005. [40] Stenstr om, M. Brorsson, and L. Sandber g. An Adapti

Cache Coherence Protocol Optimized for Migratory Shar ing. pages 109±118, May 1993. [41] J. endler S. Dodson, S. Fields, H. Le, and B. Sinharo PO WER4 System Microarchitecture. echnical report, IBM Serv er Group Whitepaper October 2001. [42] H. S. ang, L. S. Peh, and S. Malik. Po wer Model for Routers: Modeling Alpha 21364 and In˛niBand Routers. In IEEE Micr o, ol 24, No January 2003. [43] S. oo, M. Ohara, E. orrie, J. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Pr oceedings of ISCA-22 pages 24±36, June 1995.