Limiting the wer Consumption of Main Memor Br uno Diniz Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil dinizdor gi almeira dcc

Limiting the wer Consumption of Main Memor Br uno Diniz Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil dinizdor gi almeira dcc - Description

ufmgbr Ricardo Bianchini Rutgers Univ ersity USA icardobcs r utgers edu ABSTRA CT The peak po wer consumption of hardw are components af fects their po wer supply packaging and cooling requirements When the peak po wer consumption is high the hardw a ID: 25535 Download Pdf

140K - views

Limiting the wer Consumption of Main Memor Br uno Diniz Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil dinizdor gi almeira dcc

ufmgbr Ricardo Bianchini Rutgers Univ ersity USA icardobcs r utgers edu ABSTRA CT The peak po wer consumption of hardw are components af fects their po wer supply packaging and cooling requirements When the peak po wer consumption is high the hardw a

Similar presentations


Download Pdf

Limiting the wer Consumption of Main Memor Br uno Diniz Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil dinizdor gi almeira dcc




Download Pdf - The PPT/PDF document "Limiting the wer Consumption of Main Mem..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Limiting the wer Consumption of Main Memor Br uno Diniz Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil dinizdor gi almeira dcc"— Presentation transcript:


Page 1
Limiting the wer Consumption of Main Memor Br uno Diniz, Dorgiv al Guedes agner Meir Jr eder al Univ ersity of Minas Ger ais Br azil diniz,dor gi al,meira @dcc.ufmg.br Ricardo Bianchini Rutgers Univ ersity USA icardob@cs .r utgers .edu ABSTRA CT The peak po wer consumption of hardw are components af fects their po wer supply packaging, and cooling requirements. When the peak po wer consumption is high, the hardw are components or the systems that use them can become xpensi and ulk Gi en that components and systems rarely (if er) actually require peak po wer it is highly

desirable to limit po wer consumption to less- than-peak po wer udget, based on which po wer supply packaging, and cooling infrastructures can be more intelligently pro visioned. In this paper we study dynamic approaches for limiting the po w- er consumption of main memories. Specifically we propose four techniques that limit consumption by adjusting the po wer states of the memory de vices, as function of the load on the memory sub- system. Our simulations of applications from three benchmarks demonstrate that our techniques can consistently limit po wer to pre-established udget. of the

techniques can limit po wer with ery lo performance de gradation. Our results also sho that, when using these superior techniques, limiting po wer is at least as ef fecti an ener gy-conserv ation approach as state-of-the-art tech- niques xplicitly designed for performance-a are ener gy conserv a- tion. These latter results represent departure from current ener gy management research and practice. Categories and Subject Descriptors B.3 [Memory structures]: Miscellaneous General erms Design, xperimentation eyw ords Main memory po wer and ener gy management, performance 1. INTR ODUCTION The

constant quest for higher performance and greater function- ality has been producing hardw are components and systems that consume significant amounts of po wer when fully utilized. Due to This research has been supported by NSF under grant #CCR- 0238182 (CAREER ard), CNPq, APEMIG, and FINEP Brazil. Permission to mak digital or hard copies of all or part of this ork for personal or classroom use is granted without fee pro vided that copies are not made or distrib uted for pro˛t or commercial adv antage and that copies bear this notice and the full citation on the ˛rst page. cop

otherwise, to republish, to post on serv ers or to redistrib ute to lists, requires prior speci˛c permission and/or fee. ISCA ’07, June 9–13, 2007, San Die go, California, USA. Cop yright 2007 CM 978-1-59593-706-3/07/0006 ... 5.00. these high peak po wer consumptions, the po wer supply packag- ing, and cooling infrastructures of these components and systems are becoming xpensi and ulk Modern high-end processors are an xample of this trend, as the may consume in xcess of 100 atts when fully utilized, requiring xpensi packaging and ulk heat sinks to pre ent thermal emer gencies. Another

interesting x- ample is hand-held de vices, which are limited in their capacity by the ulk packaging that ould be required by aster processors and lar ger memories. As final and more xtreme xample, lar ge data centers incur xtremely high costs in pro visioning po wer and cooling infrastructures for peak po wer consumption. Gi en that hardw are components and systems rarely (if er) need to consume their peak po wer it is cheaper to design packag- ing and cooling infrastructures for the “common-case po wer con- sumption. Some processors ha been designed this ay e.g. the Pentium 4. The do

not limit their po wer consumption, ut slo themselv es do wn when temperatures increase be yond pre-defined threshold. This approach orks well ut the supply of po wer is pro- visioned to withstand the processors peak po wer consumption. dif ferent approach is to limit po wer consumption at all times to less-than-peak po wer udget. This approach allo ws for tighter pro visioning of the po wer supply infrastructure, as well as cheaper packaging and cooling. or this reason, this approach is more ap- propriate for scenarios where the supply of po wer is limited (e.g., hand-held de vices),

xpensi to er -pro vision (e.g., data centers), or can be reduced due to unit ailures (e.g., blade systems with mul- tiple po wer supplies). In this paper we study techniques for limiting the po wer con- sumption of the main memory of stand-alone computer The rea- son for this focus is four -fold. First, memory capacities (and asso- ciated po wer consumptions) ha been increasing significantly to oid accesses to lo wer le els of the memory hierarchy in serv ers and desktop systems or enable the ecution of more sophisticated applications in hand-held de vices. Second, re gardless of the en

vi- ronment, the memory subsystem is typically highly under -utilized in the sense that only fe de vices need to be acti at the same time. Third, if the po wer consumption of entire computers is to be limited with minimal performance de gradation, we need to de- elop strate gies to manage all subsystems intelligently not just the processors. ourth, since the lo w-po wer states of memory de vices retain the data the store, limiting memory po wer can also bene- fit systems where the supply of po wer packaging, and cooling in- frastructures ha already been pro visioned: for gi en system, it

enables increases in memory size without the risk of violating the system specifications, e.g. without demanding more po wer than the supply can pro vide. propose and aluate four techniques, called Knapsack, LR U- Greedy LR U-Smooth, and LR U-Ordered, that dynamically limit
Page 2
wer State/T ransition wer (mW) Delay Accessing 1167 Acti 300 Standby 180 Nap 30 Po werdo wn Acti Standby 240 memory ycle Acti Nap 160 memory ycles Acti Po werdo wn 15 memory ycles Standby Acti 240 +6 ns Nap Acti 160 +60 ns Po werdo wn Acti 15 +6000 ns Standby Nap 160 +4 ns Nap Po werdo wn 15 ns able

1: RDRAM po wer states, consumptions, and erheads. memory po wer consumption by adjusting the po wer states of the dif ferent memory de vices, as function of the load on the mem- ory subsystem. also propose ener gy- and performance-a are ersions of our techniques, while studying the tradeof fs between po wer udgets, ener gy consumption, and performance. Our aluation is based on detailed full-system simulations of nine applications from three types of systems: MediaBench bench- marks, representing the orkloads of hand-held de vices; SPEC CPU2000 benchmarks, representing the orkloads of desktop

sys- tems; and client-serv er benchmark, representing the orkloads of serv er systems. Most of our study assumes RDRAM-based mem- ory subsystems for the ability to control the po wer state of each memory chip independently Ho we er our techniques are also applicable to DDR SDRAM technologies and their module-le el, multi-chip access and po wer control; the techniques can treat entire DDR modules as the do single RDRAM chips. demonstrate the generality of our techniques, we also discuss some results for DDR2 SDRAM subsystems. Our results demonstrate that the techniques can consistently limit po

wer consumption. Our results mak tw main contrib utions: The demonstrate that tw of the techniques, Knapsack and LR U-Ordered, can limit po wer with ery lo performance de gradation. The sho that, when using these techniques, limiting po wer is at least as ef fecti for ener gy conserv ation as state-of- the-art techniques xplicitly designed for performance-a are ener gy conserv ation. The last contrib ution is particularly interesting in that limiting the po wer consumption to less-than-peak udget is quite dif ferent than the current approaches to ener gy conserv ation. Specifically ener

gy conserv ation approaches send de vices to increasingly lo wer po wer states without an constraints on the po wer consumption, i.e. during ecution, po wer consumption may (and often does) reach the peak. Our ork sho ws that limiting po wer is enough for significant ener gy sa vings with minimal performance de gradation; sending de vices to ery lo w-po wer states may actually be counter producti e. Thus, our ork pa es the ay for ne approach to ener gy conserv ation, representing departure from current ener gy management research and practice. 2. LIMITING PO WER CONSUMPTION In this

section, we present techniques for limiting the po wer consumption of the main memory subsystem to less-than-peak udget. Our techniques le erage the act that modern memory de- vices ha multiple lo w-po wer states that retain the stored data. Each po wer state consumes dif ferent amount of po wer whereas the transitions between states in olv dif ferent ener gy and perfor mance erheads. As an xample, able lists the po wer states, their po wer consumptions, and transition erheads of RDRAM memory chips [22, 31], each of which can be transitioned inde- pendently Memory accesses (to cache-line-sized

memory blocks) can only occur in acti state, although the data is retained en in po werdo wn state. The idea behind our techniques is to ha the memory controller adjust the po wer state of the memory de vices so that their erall po wer consumption does not xceed the udget. assume that the udget is pre-defined by the user or the manuf acturer of the system containing the memory Ob viously the udget has to be high enough that at least one de vice can be accessed at an time. Our techniques reserv enough po wer for acti de vices to be accessed. Thus, adhering to the po wer udget means that,

when memory de vice that is not in acti state needs to be accessed, the controller may need to change the state of one or more other de vices. The main halleng is to design tec hniques to guide the memory contr oller in selecting power states, so that it can avoid xceeding the udg et while minimizing tr ansition verheads. In particular when the po wer udget is relati ely lo some applica- tions may suf fer performance de gradations; ne ertheless, it is im- portant to minimize these de gradations by intelligently selecting po wer states. Because the memory subsystem is major ener gy consumer in

se eral en vironments (e.g., serv er systems [21]), another impor tant halleng for the state-selection tec hniques is to adher to the power udg et while enabling as muc memory ener gy conserva- tion as possible without xcessive performance de gr adation. Thus, instead of eeping the po wer consumption just belo the udget, the techniques can reduce po wer consumption further as long as the resulting performance de gradation is acceptable. Again, we as- sume that the user or the manuf acturer of the system pre-defines maximum acceptable performance de gradation resulting from en- er gy

conserv ation. In the ne xt four subsections, we introduce our techniques. The unifying principle behind the techniques is that the represent dif- ferent approaches to solving the well-kno wn Multi-Choice Knap- sack problem (MCKP) [25], as we xplain belo The Knapsack technique is the optimal static approach, whereas the others are dy- namic heuristics that dif fer in ho the form and tra erse list of recently accessed memory de vices. (Note that an optimal dynamic technique ould consider at each decision point, the sequence of future memory accesses. Because this sequence is not ailable in

practice, we do not consider such technique.) The fifth sub- section describes xtensions to the techniques that enable ener gy conserv ation without xcessi performance de gradation. The last subsection discusses the comple xity and erheads our techniques impose on the memory controller 2.1 Knapsack This technique is based on the observ ation that the goal of limit- ing po wer consumption to pre-established udget is equi alent to the MCKP The udget represents the knapsack capacity whereas each memory de vice and potential po wer state represent an object. Objects are grouped by memory de

vice, so that each group con- tains objects representing all the potential po wer states of the de- vice. The weight of each object is its po wer consumption, whereas the object cost is the performance erhead of transitioning from its po wer state to acti state. The goal is to pick one object from each set (i.e., po wer state for each memory de vice), so that the potential performance de gradation (i.e., the erage de vice acti a-
Page 3
tail (most recent) head (least recent) Non−Active chips Active chips (a) tail (most recent) head (least recent) Non−Active chips Active

chips (b) Figur 1: Illustration of Knapsack. tion delay) is minimized under the constraint that the po wer udget is not xceeded. ypically the optimal solution is the one in which the most memory de vices can be in acti state. Based on this formulation, our Knapsack technique computes the optimal configuration of po wer states for gi en po wer udget. Specifically the configuration determines the number of de vices that should be in each state. or xample, assuming the RDRAM information in able and memory chips, the optimal configura- tion ould be chip in acti state and

chips in nap state for po wer udget of 1399 mW atts. (1399 mW atts 25% of the range between the lo west possible udget, 1176 mW atts, and the highest possible udget, 2067 mW atts, plus the lo west udget. Ho we er out of these 1399 mW atts, the dif ference between the accessing and acti po wer consumptions, 867 mW atts, is reserv ed to allo chip to be acti ated. Thus, the actual 25% po wer udget for RDRAM chips is 532 1399 867 mW atts. Henceforth, the absolute alues we list for the po wer udgets already xclude the 867 mW atts.) or udget of 755 mW atts (50% of the same range mentioned abo plus

1176 mW atts minus 867 mW atts), the best configuration ould be chips in acti state and chips in nap state. The computation of the optimal configuration is performed of- fline, so that the memory controller can be initialized with the con- figuration information. Although the MCKP is NP-hard, the num- ber of memory de vices is typically small enough that en brute force solution is feasible. or xample, our ecutions for 16 mem- ory de vices tak only fe minutes to xplore all the possible configurations. Ho we er when the number of de vices is moder ate to lar ge,

heuristic algorithm (and lik ely more search time) is required. or no we use brute force. Re gardless of ho the optimal configuration is computed, the initialization of the memory controller assigns the states described in the configuration to each de vice randomly guarantee that the po wer udget is not xceeded at run time, Knapsack manages po wer states dynamically as follo ws. When the memory de vice to be accessed, the “tar get de vice, is already in acti state, no action is tak en; it can be accessed without transitioning an other de vices. When it is in lo w-po wer state, an

acti de vice is selected to tran- sition to the current po wer state of the tar get de vice. After this transition occurs, the tar get de vice can be acti ated and accessed. This approach maintains the in ariant that the number of de vices in each state is al ays as determined by solving the MCKP of fline. account for the locality of accesses across the dif ferent de- vices, Knapsack selects the acti de vice to be transitioned to lo w-po wer state using an LR queue. Specifically the LR acti de vice is selected as the victim. Figure sho ws detailed xample of the Knapsack technique.

assume RDRAM memory chips and po wer udget of 755 mW atts. The figure sho ws one box per chip, listing the chip num- ber on the left and its current po wer state on the right (A acti e, standby nap, and po werdo wn). Figure 1(a) illustrates the configuration when chip 2, currently in nap state, needs to be head (least recent) tail (most recent) (a) tail (most recent) head (least recent) (b) tail (most recent) head (least recent) (c) head (least recent) tail (most recent) (d) Figur 2: Illustration of LR U-Gr eedy accessed: chips 0, 3, and are in nap, acti e, and acti state, re-

specti ely Chips and are on the LR queue of acti chips. This is one of the optimal configurations for this number of chips and po wer udget. Because the total po wer consumption is 660 mW atts at this point, simply acti ating chip ould violate the udget. remain at an optimal configuration after chip is acti- ated, Knapsack changes the state of the LR acti chip (chip 3) to that of chip and then allo ws the access to proceed, leading to the configuration in Figure 1(b). The main problem with Knapsack is that it is only feasible when the number of de vices is small enough that

heuristic algorithm can produce close-to-optimal solution within reasonable amount of time. Furthermore, ery time change in po wer udget is desired, Knapsack in olv es recomputation of the configuration. Because recomputing the configuration may be time-consuming when the number of de vices is relati ely lar ge, we ne xt describe three techniques that do not rely on finding an optimal or close-to- optimal configuration: LR U-Greedy (Section 2.2), LR U-Smooth (Section 2.3), and LR U-Ordered (Section 2.4). As we mentioned abo e, the techniques le erage an LR queue of

memory de vices. The main dif ference between them is the ay each one tra erses the LR queue and which de vices are included in it. 2.2 LR U-Gr eedy LR U-Greedy tries to eep as man de vices as possible in ac- ti state. It in olv es single data structure ept by the memory controller the LR queue of memory de vices. All changes to the queue are performed dynamically as accesses arri at the con- troller Specifically when de vice is about to be accessed, it is remo ed from the LR queue. At this point, we ha tw possible scenarios: (1) if the tar get de vice is acti e, the controller mo es it

to the end of the queue and proceeds with the access; and (2) if the tar get de vice is in lo w-po wer state, the controller calculates whether acti ating it ould violate the udget. If not, the controller mo es it to the end of the queue, acti ates it, and allo ws the access to proceed. If so, one or more other de vices will ha to change states before the access can proceed. The distinguishing feature of LR U-Greedy is ho its tra erses the LR queue to decide on these state changes. Specifically LR U- Greedy starts with the LR memory de vice, sending it to the shal- lo west po wer state

that ould satisfy the udget. If changing the state of the LR de vice alone is not enough, it is left in po werdo wn state and the process is repeated for the ne xt de vice on the queue, and so forth, until the udget is finally satisfied. Figure illustrates the operation of LR U-Greedy in the same scenario as Figure 1. Figure 2(a) illustrates the status of the queue
Page 4
head (least recent) tail (most recent) (a) tail (most recent) head (least recent) (b) tail (most recent) head (least recent) (c) head (least recent) tail (most recent) (d) Figur 3: Illustration of LR

U-Smooth. when chip 2, currently the LR chip, needs to be accessed: chips 2, 0, 3, and are in nap, nap, acti e, and acti state, respecti ely Again, because the total po wer consumption is 660 mW atts at this point, simply acti ating chip ould violate the udget. Thus, Figure 2(b) sho ws the queue after chip is remo ed from it. Since chip will consume 300 mW atts when acti ated, the chips still on the queue can consume at most 755 300 455 mW atts. LR U- Greedy then tra erses the queue to reduce the consumption belo this alue. Figure 2(c) sho ws the queue after chip is sent to po w- erdo wn state

and chip is sent to nap state. These changes bring the consumption of the queue to 333 mW atts. Finally Figure 2(d) sho ws the queue after chip is acti ated and mo ed to the end of the queue, leading to consumption of 633 mW atts. The memory access can proceed at that point. 2.3 LR U-Smooth The LR U-Smooth technique tries to eep more de vices in shal- lo lo w-po wer states, rather than fe wer de vices in deeper po wer states as in LR U-Greedy accomplish this, LR U-Smooth tra- erses the LR queue dif ferently than LR U-Greedy when the tar get de vice is in lo w-po wer state and acti ating it

ould violate the po wer udget. Specifically LR U-Smooth goes through the LR queue (from the LR de vice to the MR de vice) sending each de- vice to the ne xt lo wer po wer state (and entually returning to the front of the queue, if necessary) until the set of de vices in the queue consumes less po wer than the udget minus the po wer consumption of one acti de vice. Figure illustrates ho LR U-Smooth orks in the same sce- nario as Figures and 2. Figures 3(a) and 3(b) are the same as in LR U-Greedy Ho we er as Figure 3(c) depicts, LR U-Smooth switches chips 0, 3, and to po werdo wn, standby

and standby These changes bring the consumption of the chips on the LR queue to 363 mW atts. At that point, chip can be inserted back, acti ated, and accessed, for final consumption of 663 mW atts. 2.4 LR U-Order ed LR U-Ordered addresses the problems of LR U-Greedy and LR U- Smooth at the same time. The idea is to assign lo w-po wer states enly (as in LR U-Smooth) ut oid sending acti de vices to lo w-po wer mode if possible (as in LR U-Greedy and Knapsack). This is accomplished by creating an additional data structure: priority queue (implemented as heap) for the memory de vices that

are in lo w-po wer states. The queue is ordered by ho shallo the po wer mode is; de vices in shallo wer states are selected to go to deeper states first. or this reason, we refer to it as the “ordered queue. The LR queue is then reserv ed for acti de vices only tail (most recent) Non−Active chips (Ordered queue) Active chips (LRU queue) head tail head (least recent) (a) tail (most recent) Non−Active chips (Ordered queue) Active chips (LRU queue) head tail head (least recent) (b) Figur 4: Illustration of LR U-Order ed. In more detail, LR U-Ordered operates in similar manner to

LR U-Greedy and LR U-Smooth. The dif ferences are (1) the han- dling of the tw queues; and (2) the actions that are tak en when the tar get de vice is in lo w-po wer state and acti ating it ould violate the po wer udget. The handling of the queues is done in the ob vi- ous manner When (2) occurs, the controller first mo es the LR acti de vice to the front of the ordered queue. Then, it repeatedly sends the de vice at the top of the heap to the ne xt lo wer po wer state until the erall po wer consumption is lo wer than the udget minus the po wer consumption of one acti de vice. Figure

depicts ho LR U-Ordered orks in the same scenario as Figures 1, 2, and 3. Figure 4(a) sho ws the ordered (top) and LR (bottom) queues when the access for chip arri es, whereas Figure 4(b) sho ws the configuration after chip is do wngraded to nap state and the access is allo wed to proceed. The final configura- tion consumes 660 mW atts and is actually optimal for this number of chips and po wer udget. 2.5 erf ormance-A war Ener gy Conser ation Memory ener gy conserv ation is an important goal in man en- vironments. or xample, because batteries should last as long as possible

in battery-operated de vices, conserving memory ener gy in these de vices is beneficial. More interestingly in the tw IBM p670 serv ers measured in [21], memory po wer represents 19% and 41% of the total po wer whereas the processors account for only 24% and 28%; conserving memory ener gy is for these serv ers. ith these dif ferent en vironments in mind, we should conserv as much ener gy as possible (be yond the conserv ation that comes naturally from the lo wer po wer udget), at the same time as limiting po wer consumption. Thus, we de eloped ersions of our techniques that conserv

additional ener gy as long as doing so does not de grade performance be yond pre-established threshold. 2.5.1 Memory Ener gy Conservation Our approach for conserving additional ener gy is simple: the memory controller is responsible for sending de vice to lo wer po wer state when the de vice has been idle at the current state for the state “break-e en time. The break-e en time is defined as the time it tak es for the ener gy consumption in the current state to equal the ener gy it ould tak to go do wn to the ne xt lo wer po wer state and then immediately go up to acti state. Assuming the

RDRAM states and transition costs from able 1, the break-e en times for the transitions from acti to standby standby to nap, and nap to po werdo wn are 14 ns, 69 ns, and 3333 ns, respecti ely Our ener gy conserv ation approach uses the break-e en time as the transition threshold time. This same po wer -management approach has been used in number of pre vious papers on memory ener gy conserv ation, e.g. [20, 29]. Gi en our approach to ener gy conserv ation, we modified our techniques as follo ws:
Page 5
Knapsack. modified this technique to compute (still of fline)

the optimal configurations for all possible numbers of acti de- vices, i.e. from to the total number of de vices in the system, and store them in table at the memory controller Ho we er in- stead of trying to minimize the erage transition delay we modi- fied the technique to minimize the erage del ay pow er gi ving more importance to lo po wer than lo transition delay (This met- ric should not be confused with ecution-time ener del ay which is often used to compare ener gy conserv ation techniques when performance is concern.) ith the optimal configura- tion information, the

memory controller dynamically changes con- figurations when the number of acti de vices is about to change, i.e. (1) when an acti de vice transition threshold elapses and the de vice is about to be sent to the first lo w-po wer state; or (2) when the tar get de vice is in lo w-po wer state and acti ating it ould not violate the po wer udget. In those situations, the controller looks up the best configuration for the ne number of acti de vices and adjusts states accordingly The adjustments are made so that the smallest possible number of state changes is performed. When ac- ti

ating de vice ould violate the udget, the basic strate gy of xchanging po wer states is used. LR U-Gr eedy LR U-Smooth, and LR U-Order ed. The small mod- ification we made to these techniques is the same. Whene er the threshold time associated with the current state of de vice xpires, the ne state of the de vice (the ne xt lo wer po wer state) is recorded. In case of LR U-Ordered, the ne state may also cause change in the ordered queue. The ener gy-conserving ersions of our techniques ha the label “EC appended to their names. 2.5.2 erformance Guar antee Guaranteeing certain performance

le el in the conte xt of less- than-peak po wer udget may not be possible. The reason is that the actual performance de gradation xperienced by an application depends on its memory access properties, on the po wer udget, and on the technique used to limit the po wer consumption. If the access properties are not ideal for the technique or the po wer udget is lo significant performance de gradation may ensue. Ho we er it is certainly possible to limit the performance de gra- dation esulting fr om attempting to conserve additional ener gy to an acceptable percent threshold. The reason is

that we ha the option to stop trying to conserv additional ener gy when the de gra- dation starts to xceed the threshold. (In contrast, we cannot stop respecting the po wer udget.) created modified ersions of our ener gy-conserving tech- niques for them to pro vide soft performance guarantee. These ersions ha the label “EC-Perf appended to their names. In detail, the performance guarantee is based on the notions of slac and epoc The slack is the total sum of delays that the mem- ory accesses are allo wed to xperience without violating the perfor mance guarantee. or xample, if the

performance guarantee is and the erage memory access time without ener gy conserv ation is the erage memory access time with ener gy conserv ation should be no orse than (1 Note that, in our approach, delays in memory access time are assumed to translate directly into delays in end-performance. Although this is pessimistic assump- tion for modern processors, it does mak sure that our techniques do not violate the performance guarantee. An epoch defines fix ed-length interv al (5M processor ycles in our xperiments) of the application ecution. At the start of each epoch, the ailable

slack for the epoch is computed as the epoch air share of the allo wed slo wdo wn plus an lefto er slack from pre vious epochs. Because the state of the lists and memory de vices in the EC-Perf ersions of our techniques can de viate significantly from their corresponding base ersions, correctly maintaining the ailable slack during an epoch is major challenge. solv this problem, the EC-Perf ersions compare the (list-processing and state-transition) erhead the incur on each memory access with the erhead that ould ha been incurred by their corresponding base ersions. determine the erhead of

the base ersion, the memory controller “simulates the lists and de vice states under the base ersion without actually performing an memory accesses or state transitions. The simulation is performed of the critical path of the access, i.e. while the corresponding memory block is being transferred on the us. The slack is decreased by the dif ference between the erhead of the EC-Perf ersion and that of the base ersion. If the former erhead is higher the slack decreases; oth- erwise, it increases. If the ailable slack er becomes non-positi during an epoch, our techniques turn of ener gy conserv

ation until the end of the epoch and send chips to their corresponding states in the simula- tion. When an epoch ends, we adjust transition thresholds listed in the pre vious subsection using the same approach as Li et al. [22]. Intuiti ely if some slack is left unused at the end of the epoch, the thresholds are reduced. If not, the thresholds are increased. Our approach for pro viding performance guarantees is inspired by [22] ut with three dif ferences. First, turning ener gy con- serv ation of in our conte xt does not mean eeping all de vices in acti state. Because we still need to respect

the po wer udget, our techniques re ert back to configuration that does so. Second, again due to the less-than-peak udget, we need to identify the de- lays that are really caused by trying to conserv additional ener gy Finally our handling of epochs is dif ferent in that the correspond to fix ed time periods (independent of the instructions ecuted by the processor) in our approach. 2.6 Complexity and Ov erheads Our techniques are simple to implement. Man memory con- trollers, e.g. [40], include lo w-po wer processors and highly inte- grated uf fering resources. assume that the

controller resides in the processor chip, as in the Niagara processors from Sun and the Opteron processors from AMD. In terms of uf fering, our techniques require enough space to store their one or tw queues, each of which can ha at most total memor dev ices entries. Knapsack and its ariants also re- quire space for storing the best state configurations, each of which has xactly total memor dev ices entries. or ener gy conserv a- tion, our techniques require counter per de vice to account for the transition threshold. pro vide performance guarantees, our EC- Perf techniques require

counter for the ailable slack, counter for the epochs, counter for list-processing and state-transition erheads, uf fer space for simulating their respecti base er sions, and counter for the erhead of these ersions. In terms of processing erheads, our techniques need to update their LR queues whene er memory access arri es for de vice that is dif ferent than the last de vice accessed. LR U-Ordered and its ariations also need to update the ordered queue, ut only when de vice changes po wer state. The erhead of these updates is fe pointer manipulations. In contrast, an access to de vice in lo

w-po wer state that can be acti ated without violating the udget also in olv es fe arithmetic/logic operations to erify that this is indeed the case. The techniques need to tra erse their queue(s), ut only when an access is directed to de vice in lo w-po wer state and acti ating the de vice ould violate the po wer udget. pro vide performance guarantees, the ailable slack needs to be dynami- cally updated. Because the simulation of the base ersion must be
Page 6
performed of the critical path of memory accesses, the controller must be capable of simulating base accesses while the

cache lines are transferred on the us. This can be done without impacting running time, since line transfers tak (56 processor ycles in our RDRAM xperiments) much longer than the simulation erhead. In our aluation, we simulate all of these pr ocessing verheads in detail. In act, we carefully assessed them by first implementing the required operations xplicitly and then counting the number of x86 assembly instructions to which the translate. From this study we found that updating queue entry remo ving queue entry (af- ter it has been found), and inserting queue entry should tak

instructions, instructions, and instructions, respecti ely assume that the controller can process one x86 instruction per y- cle. Gi en these alues, we find the list-processing erheads to be small fraction of the latenc of an actual (cache-line-sized) memory access. or xample, for chips and 50% udget, the erage number of list-processing erhead ycles per memory ac- cess is 9.0, 10.9, 11.2, and 12.9 for Knapsack, LR U-Greedy LR U- Smooth, and LR U-Ordered, respecti ely When performance-a are ener gy conserv ation is being used, the erage number of erhead ycles per memory access in LR

U-Ordered is 15.0, again with chips and 50% udget. Thus, in the orst case for this configu- ration, the list-processing erheads represent 16% increase in the erage memory access latenc of 92.1 processor ycles (in the absence of po wer limitation, memory controller erheads, or ener gy conserv ation) in our xperiments. In the most challeng- ing configuration we study with 16 chips and 25% udget, the list-processing erheads increase to 21%. 3. EV ALU TION 3.1 Methodology Our aluation is based on detailed full-system simulations using Simics [38] ersion 2.2.19, and our simulation code

for the mem- ory subsystem and the po wer -limiting techniques. simulate an x86 in-order single-core processor running at 2.8 GHz. The cache hierarchy is composed of tw le els. The first le el (L1) is split, 2-w ay set associati e, 64-KB cache with 32-byte lines and 2- ycle access time. The second le el (L2) is 4-w ay set associati e, 256-KB cache with 64-byte lines and an 8-c ycle access time. The Simics processor model we use only allo ws the L2 cache to ha one outstanding miss at time. Simulating one outstanding miss at time xposes an memory access erheads associated with lim- iting

po wer Ne ertheless, we do study the sensiti vity of our results to the number of concurrent outstanding misses in Section 3.3. Because full-system simulations tak xtremely long to com- plete, each Simics run generates trace of the memory accesses performed by an application. The trace is later used to dri de- tailed simulations of the memory subsystem under our dif ferent techniques. simulate memories with 512 MBytes. Through- out most of the aluation, we simulate RDRAM chips running at 1.6 GHz (T able 1). Each memory chip is capable of transferring bytes per memory ycle, pro viding peak

transfer rate of 3.2 GB/s. Recall that we simulate RDRAM by def ault, ut also discuss results for DDR2 SDRAM in Section 3.3. Based on the RDRAM manuals, we define that filling an L2 cache miss on load from an acti chip tak es 130 processor ycles, when both ro and column access are necessary An L2 cache writeback to an acti chip tak es 88 processor ycles, again when both ro and column accesses are necessary These times include the 56 processor ycles required to transfer the L2 cache line on the memory us. simulate one open 2-KByte page per chip. In contrast with some RDRAM memory

controllers, our techniques do not transition chip to standby state immediately after an ac- cess, since the are primarily intended to limit po wer consumption (rather than conserv ener gy). use set of eight applications from three types of systems: four from the MediaBench benchmarks [26], representing the ork- loads of hand-held de vices; three from the SPEC CPU2000 bench- marks [6], representing the orkloads of desktop systems; and client-serv er benchmark, representing the orkloads of serv er sys- tems. carefully selected the applications. The MediaBench ap- plications we study are epic

gsdecode gsmencode and mpe g2en- code The are the longest-running applications in MediaBench. The SPEC CPU2000 applications are bzip2 gzip and vorte with the tr ain input set (running with ef input sets ould ha been im- practical in terms of simulation time). The reason we chose these CPU2000 applications is that their beha vior is similar under ef and tr ain input sets [39]. In Section 3.3, we also study mcf the most memory-bound application in the CPU2000 suite, to assess the be- ha vior of our techniques in an xtremely unf orable scenario. Fi- nally the client-serv er application (called CS

hereafter) comprises an Apache serv er and client, each being simulated by an instance of Simics. Simics also simulates the netw ork. The HTTP orkload is the Clarknet publicly ailable eb-serv er trace. run 30,600 requests of it to limit the simulation time and sho results for the serv er Our xperiments simulate the applications and the operating system (Linux 2.6.8). Thus, the allocation of virtual pages to phys- ical frames is done by Linux itself. assign consecuti physical frames to the same chip until the chip runs out of space, mo ving on to the ne xt chip after that. study the ef fect of

se eral parameters on the beha vior of our techniques: the po wer limit, the number of memory chips, the max- imum number of outstanding cache misses, the memory technol- ogy whether memory ener gy conserv ation is enabled, whether per formance guarantees are pro vided when conserving ener gy The po wer limit is defined as fraction of the range of possible po wer udgets. Specifically the po wer limit is: per cent budg et maxpb (1 per cent budg et minpb paccess pactiv access siz where maxpb is the maximum po wer udget (i.e., the peak po wer consumption), minpb is the minimum po wer

udget (which is not the consumption of the po werdo wn state times the number of chips, because it must be possible to access at least one chip), per cent budg et is alue from to paccess and pactiv are the po wer consumption in accessing and acti states, respecti ely and access siz is the number of chips in olv ed in each memory access. study 25%, 50%, and 75% as alues for the udget. In terms of the number of memory chips, we study 4, 8, and 16. These numbers co er the range from hand-held de vices to small-scale serv ers. Although lar ger serv ers may use more than 16 chips, these systems

also typically ha multiple memory controllers, which limits the number of chips assigned to each controller [21]. In such system, the udget can be partitioned between the controllers, each of which can enforce its fraction as in this paper scaled do wn the sizes of the caches that we simulate because the applications are run with relati ely small inputs. ith our set- tings, the CPU is stalled aiting for the memory subsystem in epic, gsdecode, gsmencode, mpe g2encode, bzip2, gzip, orte x, and CS for 9.9%, 2.2%, 13.6%, 2.2%, 20.6%, 2.6%, 3.5%, and 10.9% of the ecution time (without limiting po

wer or conserving ener gy), respecti ely The same quantity for mcf is 55.5%. Our simulations fix the memory size at 512 MBytes to oid in- troducing an xtra parameter that is not as important as the others we study Although this memory size may seem xcessi for some of these applications, recall that we simulate the operating system (and its mapping of virtual pages to physical memory frames) as
Page 7
0 200 400 600 800 1000 1200 1400 Power (mWatts) Time Power Consumption Limitation (CS - 8 chips - 50% budget - LRU-O - PL-EC-Perf) Limit Max Avg Figur 5: wer limit, and maximum and

erage po wer well. In act, for these applications and numbers of chips, the ac- cesses are spread across all chips. Ho we er there are al ays fe chips that recei lar ger fraction of the accesses. or xample, CS on chips xhibits tw chips that each recei around 29% of the accesses, whereas the other chips recei between 3% and 19% of the accesses. Bzip2 on chips xhibits similar beha vior where tw chips recei 29% and 23% of the accesses, whereas the oth- ers recei between 1% and 18% of the accesses. Decreasing the memory size (while eeping the same number of chips) ould ha the ef fect of more enly

distrib uting the accesses across the chips. This ef fect is akin to increasing the number of chips (while eeping the same memory size), as we do in this paper Our graphs refer to the techniques as Knap (Knapsack), LR U-G (LR U-Greedy), LR U-S (LR U-Smooth), and LR U-O (LR U-Order- ed). refer to the (base) ersions that only limit po wer con- sumption as PL (for Po wer -Limited), the ersions that also con- serv memory ener gy as PL-EC (for Ener gy-Conserving), and the ersions that limit po wer conserv memory ener gy and limit the performance de gradation resulting from ener gy conserv ation as

PL- EC-Perf (for Performance). The de gradation threshold as al ays ept at 3% in the PL-EC-Perf simulations. Because the parameter space comprises dimensions (technique, ariation, application, po wer limit, number of chips, maximum number of outstanding misses, and memory technology), we present figures for the inter esting (2D) parts of the space. Finally the tw metrics we consider are erall application performance de gradation and memory ener gy sa vings. do not model processor or system ener gy xplicitly in our simulations. Ho we er as the results in the ne xt subsection demonstrate,

our best po wer -limiting techniques de grade performance only slightly 2% in the ast majority of cases), i.e. the memory ener gy sa vings we report are close estimate of the erall ener gy sa vings that are achie able. In act, we can easily estimate the erall ener gy sa v- ings by assuming particular breakdo wn between the ener gy con- sumed by the memory and the rest of the system (without po wer limitation or ener gy conserv ation). If the memory represents of the total system ener gy the memory ener gy sa vings is and the de gradation is assumed to be 0%, the erall system-wide en- er gy sa

vings is m= 100 3.2 Base Results Before getting into an xtensi analysis of results, it is important to emphasize that our tec hniques ar successful at limiting power consumption to the pr e-established udg et at all times and for all applications As an xample of the po wer consumption beha vior of applications under po wer limit, Figure plots the po wer con- sumption of part of the ecution of CS under LR U-Ordered with performance-a are ener gy conserv ation (PL-EC-Perf), assuming memory chips and po wer limit of 1361 mW atts (50% udget for chips). The figure plots the po wer limit, the

maximum po wer consumption of each interv al of 1M processor ycles, and the v- erage po wer consumption during those interv als, as function of time. During the second half of the slice illustrated in the figure, the performance slack is xhausted, causing the maximum and v- erage po wer consumptions to concentrate close to the limit. Note that, in this re gion, there are still po wer state changes since the limit does not allo all memory chips to be acti at the same time. can also clearly see that the limit is ne er violated, despite the act that po wer -state configurations

changed lar ge number of times. Further we observ that the erage po wer consumption is close to the maximum consumption most of the time, suggesting that performance de gradations should be small. The result for CS in Figure 6(left) sho ws that this is indeed the case. erf ormance degradation due to limiting po wer consumption. Figure sho ws the performance de gradation suf fered by each ap- plication, as function of dif ferent parameters. In Figures 6(left) and 6(middle), each de gradation is computed with respect to an “unrestricted ecution that does not impose an limits on the po wer

consumption, does not attempt to conserv ener gy (all chips are acti all the time), and in olv es no memory controller er heads. In Figure 6(right), the de gradations are computed with re- spect to the unrestricted ecution with the corresponding number of memory chips. The set of three bars on the right of each graph presents the erage across all applications. Figure 6(left) compares the performance de gradation of the four techniques we study assuming chips and 50% po wer udget. or all applications, Knapsack and LR U-Ordered de grade perfor mance only slightly: less than 3% in all cases; less

than 1% in all ut one case (bzip2). Knapsack beha es so well because it opti- mizes performance within the ailable po wer udget. On the other hand, LR U-Ordered does well by achie ving similar ef fect; it at- tempts to oid going do wn to deep lo w-po wer states, en if an acti chip needs to be sent to (shallo w) lo w-po wer state. In contrast, LR U-Greedy and LR U-Smooth de grade performance more substantially; the can xhibit de gradations as high as 76% and 67%, respecti ely Surprisingly these tw techniques do poorly for the same applications (epic, bzip2, and orte x), despite the act that the

tra erse their LR queues ery dif ferently At closer in- spection, one can easily understand these results: neither technique is capable of ef fecti ely limiting erage transition delays. Specifi- cally LR U-Greedy sends fe chips to deep po wer states, whereas LR U-Smooth sends more chips to shallo lo w-po wer states, in- cluding multiple acti chips. Although neither technique does well, LR U-Smooth performs little better than LR U-Greedy Figure 6(middle) compares performance de gradations as func- tion of the po wer udget for LR U-Ordered, again assuming chips. As we ould xpect,

increasing the udget decreases the de grada- tions, since chips can stay in shallo wer po wer states. More inter estingly en with small udget of 25%, LR U-Ordered de grades performance by less than 2.5%, xcept in the case of bzip2 (13% de gradation). Knapsack achie es similar results, whereas LR U- Greedy and LR U-Smooth xhibit significant de gradations with the 25% udget (not sho wn). Figure 6(right) compares performance de gradations as function of the number of memory chips for LR U-Ordered, assuming 50% udget. Interestingly note that, in all ut one case, de gradations de- crease as

we increase the number of chips from to 8, ut increase when we go from to 16 chips. Initially we xpected de grada-
Page 8
0 1 2 3 4 5 epic gs gsm mpeg bz gz vor CS avg Performance Degradation (%) Performance Degradation by Techniques 22% 20% 76% 67% 35% 31% 18% 16% Knap LRU−G LRU−S LRU−O 0 0.5 1 1.5 2 2.5 3 3.5 4 epic gs gsm mpeg bz gz vor CS avg Performance Degradation (%) Performance Degradation by Power Limits 13% 25% 50% 75% 0 0.5 1 1.5 2 2.5 3 3.5 4 epic gs gsm mpeg bz gz vor CS avg Performance Degradation (%) Performance Degradation by Number of Chips 8% 9%

11% 4 chips 8 chips 16 chips Figur 6: erf ormance as function of technique (left), udget (middle), number of chips (right), and application. 0 10 20 30 40 50 epic gs gsm mpeg bz gz vor CS avg Energy savings (%) Energy Savings by Techniques Knap LRU−G LRU−S LRU−O 0 10 20 30 40 50 60 70 epic gs gsm mpeg bz gz vor CS avg Energy savings (%) Energy Savings by Power Limits 25% 50% 75% 0 5 10 15 20 25 30 35 40 45 50 epic gs gsm mpeg bz gz vor CS avg Energy savings (%) Energy Savings by Number of Chips 4 chips 8 chips 16 chips Figur 7: Ener gy sa vings as function of technique

(left), udget (middle), number of chips (right), and application. tions to consistently decrease with increasing number of chips, so this result surprised us. Upon closer inspection, it becomes clear that our intuition missed the act that lar ger number of chips may result in combination of more acti chips ut also more chips in po werdo wn state. As an xample, note that the optimal Knap- sack configuration for chips dictates acti chips and chips in nap state, whereas that for 16 chips dictates acti chips, chips in nap state, and (the culprits for the performance de gradation) chips in po

werdo wn state. address this problem, we could change Knapsack to minimize del ay and change LR U-Ordered to include (rather than 1) acti chips in the ordered queue when the acti ation of chip in lo w-po wer state ould violate the ud- get. Ho we er note that de gradations are ne er higher than 11% en with 16 chips, so we did not pursue these changes. Knapsack xhibits similar trends and absolute de gradations as LR U-Ordered (not sho wn). LR U-Greedy and LR U-Smooth xhibit much higher absolute de gradations and the trends that we initially xpected, as the increase in number of chips reduces

their percentage of chips in po werdo wn state (not sho wn). Ener gy conser ation due to limiting po wer consumption. Fig- ure sho ws the memory ener gy sa vings, as function of dif fer ent parameters. In Figures 7(left) and 7(middle), each amount of sa vings is computed with respect to an unrestricted ecution that does not impose an limits on the po wer consumption, does not in olv memory controller erheads, and does not attempt to con- serv ener gy In Figure 7(right), the sa vings are computed with respect to the unrestricted ecution with the corresponding num- ber of chips. Note that the

savings depicted in these figur es come xclusively fr om the less-than-peak udg et Figure 7(left) compares the ener gy sa vings achie ed by the four techniques we study assuming chips and 50% po wer udget. or all applications ut bzip2, all techniques conserv substantial ener gy; at least 34%. Knapsack and LR U-Ordered do well for bzip2, ut LR U-Greedy and LR U-Smooth do not. In act, the lat- ter techniques also conserv noticeably less ener gy than the former ones for epic and orte x. The reason for these results is that LR U- Greedy and LR U-Smooth increase ecution time tremendously

(thereby increasing erall ener gy consumption) for these applica- tions and simulation parameters, as we illustrated in Figure 6(left). Figure 7(middle) compares ener gy sa vings as function of the po wer udget for LR U-Ordered, again assuming chips. As we ould xpect, increasing the udget decreases the sa vings, since chips can stay in shallo wer po wer states. The most interesting result here, though, is that the applications characteristics ha much weak er impact on the sa vings than the udgets do. The reason is that what really matters in terms of ener gy is the ratio between lim- ited and

unlimited po wer consumption. Knapsack beha es simi- larly to LR U-Ordered (not sho wn). LR U-Greedy and LR U-Smooth do so as well, with the only dif ference that sa vings actually increase for bzip2, as we increase the udget (not sho wn). This beha vior dif- fers from that of other applications and is xplained by the act that higher udgets produce substantially smaller performance de grada- tions for bzip2 with these techniques. Figure 7(right) compares ener gy sa vings as function of the number of chips for LR U-Ordered, assuming 50% udget. As suggested by our comments abo e, the number of

chips has rel- ati ely small ef fect on the percentage ener gy sa vings due to the po wer limitation. Again, the xception to this observ ation is bzip2 with LR U-Greedy and LR U-Smooth, since using more chips also impro es performance significantly (not sho wn). erf ormance-awar ener gy conser ation under po wer limita- tions. First, note that our performance guarantee algorithm al ays limits the de gradation caused by xplicitly trying to conserv addi- tional ener gy
Page 9
0 2 4 6 8 10 epic gs gsm mpeg bz gz vor CS avg Performance Degradation (%) Performance Degradation by

Techniques Versions 237% 60% 66% 35% 1153% 67% 206% 32% 282% PL PL−EC PL−EC−Perf 0 10 20 30 40 50 60 70 80 90 100 epic gs gsm mpeg bz gz vor CS avg Energy savings (%) Energy Savings by Techniques Versions PL PL−EC PL−EC−Perf Figur 8: erf ormance (left) and ener gy sa vings (right) as function of ariation of LR U-Order ed and application. 0 1 2 3 4 5 6 7 8 epic gs gsm mpeg bz gz vor CS avg Performance Degradation (%) Perfomance Degradation for PL vs PD (PL) vs PL−EC−Perf vs PD (PL−EC−Perf) 13% 16% PL (25%) PD (PL)

PL−EC−Perf (25%) PD (PL−EC−Perf) 0 10 20 30 40 50 60 70 80 90 100 epic gs gsm mpeg bz gz vor CS avg Energy Savings (%) Energy Savings for PL vs PD (PL) vs PL−EC−Perf vs PD (PL−EC−Perf) PL (25%) PD (PL) PL−EC−Perf (25%) PD (PL−EC−Perf) Figur 9: Comparing the perf ormance degradation (left) and ener gy sa vings (right) of PL, PL-EC-P erf and PD as function of application. The graphs assume chips, 25% po wer udget, and 3% maximum degradation due to ener gy conser ation. assess the performance de gradations with respect to

unre- stricted ecutions that impose no limits on po wer consumption, do not in olv memory controller erheads, and do not attempt to conserv ener gy consider Figure 8(left). It compares performance de gradations as function of the ariations of the LR U-Ordered technique, assuming chips, 50% po wer udget, and maxi- mum acceptable performance de gradation of 3%. The base er sion of LR U-Ordered is referred to as PL in the figure. These re- sults sho that PL causes the smallest de gradation in performance, less than 1% on erage. In contrast, the ener gy-conserving ari- ation (PL-EC) causes

tremendous performance de gradation; more than 1500% in one case (bzip2). Similar dramatic de gradations ha been observ ed before when memory ener gy conserv ation is not performance-a are [22]. When performance guarantee is im- posed on the ener gy-conserving ariation (PL-EC-Perf), de grada- tions are reduced to the range 3-6% (recall that these results include the de gradation coming from the po wer limitation and from trying to conserv additional ener gy). Knapsack xhibits similar beha vior In contrast, LR U-Greedy and LR U-Smooth xhibit high de grada- tions for bzip2 and orte x, en under

our performance guarantee (not sho wn). Figure 8(right) sho ws the ener gy sa vings that can be achie ed by the ariations of LR U-Ordered, assuming chips, 50% po wer udget, and maximum acceptable de gradation of 3%. The sa vings are computed with respect to the unrestricted ecutions. These data sho that the PL-EC ariation of our techniques can typically increase ener gy sa vings significantly ut only at the cost of the in- creased ecution time as we just sa In the case of bzip2, the tremendous increase in ecution time actually induces substan- tially lo wer sa vings. This ef fect is less

pronounced for the other applications, as the xhibit lo wer de gradations. The other tech- niques xhibit similar beha viors (not sho wn). When our performance guarantee is in place (PL-EC-Perf), the additional ener gy sa vings achie ed by our techniques (only results for LR U-Ordered are sho wn) are significant compared to those of PL for all applications, xcept bzip2. As we discuss belo de- creasing the udget also decreases the dif ference in ener gy sa vings between PL and PL-EC-Perf, as chips ha to stay in the deeper po wer states lar ger fraction of the time just to respect the

udget. Comparing PL and PL-EC-P erf against state-of-the-art ener gy conser ation. More importantly the PL and PL-EC-Perf ariations of Knapsack and LR U-Ordered can conserv at least as much en- er gy as state-of-the-art techniques that conserv ener gy xplicitly in the absence of po wer limit, for equally small performance de gra- dations. In particular these techniques can match the ener gy sa v- ings of the latter approach by appropriately setting the po wer ud- get. support these claims, Figure compares the PL and PL-EC- Perf ariations of LR U-Ordered (with 25% udget) against the best

performance-a are memory ener gy conserv ation technique, the PD technique described in [22], for chips. PD dynamically changes transition thresholds according to predicted access patterns and ailable slack. Note that there are tw PD bars for each ap- plication: PD(PL) represents the ecution where we allo PD to de grade performance by the same amount as PL with 25% ud- get; PD(PL-EC-Perf) represents the ecution where we allo PD to de grade performance by the same amount as PL-EC-Perf with
Page 10
25% udget. achie the best possible results for PD, we chose the epoch length that

xhibits the highest ener gy sa vings on erage for our set of applications, namely 50M processor ycles. The figure sho ws that PD(PL) achie es significant ener gy sa v- ings for most of the applications, 55% on erage. In terms of per formance, as intended, PD(PL) matches the de gradation of PL in all cases, xcept bzip2. or bzip2, PD(PL) is unable to xploit the ad- ditional slack to conserv more ener gy PL conserv es more ener gy than PD(PL) in almost all cases, achie ving 66% sa vings on er age. Similar observ ations can be made when comparing PD(PL- EC-Perf) and PL-EC-Perf. On

erage, our technique achie es 83% ener gy sa vings, whereas PD(PL-EC-Perf) produces 68% sa v- ings. In act, PD(PL-EC-Perf) conserv es only slightly more ener gy than PL on erage, en though we allo wed PD(PL-EC-Perf) 3% de gradation in performance be yond the degradation of PL. Comparing Figures 8(right) and 9(right) sho ws that the dif fer ence in ener gy sa vings between PL and PL-EC-Perf decreases sig- nificantly with lo wer udgets. The more we reduce the po wer ud- get or the maximum acceptable de gradation, the more similar PL and PL-EC-Perf become. Despite their positi results, the

beha vior of PL and PL-EC- Perf for bzip2 is concern. or this udget of 25%, their perfor mance de gradations are clearly unacceptable. Although the con- serv less ener gy than our techniques, PD(PL) and PL(PL-EC-Perf) xhibit better performance for bzip2 (less than 6% de gradations). As we had already seen in Figure 6(middle), our techniques xhibit lo performance de gradations for bzip2 with po wer udgets higher than 25%. or higher udgets, PL and PL-EC-Perf can again be- ha better than PD for this application. These results demonstrate that, with an appropriately set po wer udget, PL and

PL-EC-Perf are indeed superior to the best ener gy conserv ation techniques proposed to date. The intuition her is that ther is little point in (e ventually) sending hips to very deep states, as in PD, if activating these hips little later will con- sume lar fr action of the slac k. The more states there are, the greater the potential problem is. In PL-EC-Perf this ef fect is not as pronounced because the po wer limit still has to be respected when the slack runs out. The better approach, as in Knapsack and LR U- Ordered, is to eep chips at their “best”, deep-enough states, e- gar dless of how

long the stay ther ener gy conserv ation comes from the po wer limitation itself. 3.3 Additional Results Limiting po wer under gr eater concurr ency in the memory sub- system. So ar our simulations assumed maximum of out- standing L2 cache miss. Ho we er modern systems often allo multiple concurrent cache misses. understand the ef fect that our techniques ould ha on systems with greater memory-access concurrenc we de eloped ne ersion of our simulator Specif- ically we implemented an idealized processor capable of issuing memory requests without er stopping instruction ecution, un- til maximum

number of outstanding misses is reached. In other ords, the idealized processor is capable of completely erlapping instruction ecution with cache misses until cache miss occurs that ould xceed the maximum number of outstanding misses. This beha vior is equi alent to al ays predicting load alues cor rectly [24]. Thus, our idealized processor xacerbates the amount of concurrenc that the memory subsystem ould see in reality Using the ne ersion of the simulator we studied the beha vior of LR U-Ordered with chips and 50% po wer udget, for systems with at most 1, 2, and outstanding cache misses.

Each de gra- dation and sa vings as computed with respect to the (idealized) ecution with the corresponding maximum number of outstand- ing misses, no limits on po wer and no ener gy conserv ation. Our results sho that performance de gradations associated with limit- ing po wer consumption are en smaller for systems with greater memory-access concurrenc or xample, systems with maxi- mum of outstanding misses xhibit less than 0.2% de gradation from LR U-Ordered on erage. The main reason for this result is that an increasing fraction of the erheads of limiting po wer consumption, namely

controller and state-transition erheads, can be erlapped as we increase the maximum number of outstand- ing misses. Interestingly the ener gy sa vings associated with limit- ing po wer are almost unaf fected by systems with greater memory- access concurrenc The reasons are that all de gradations are ery small and, most importantly the ener gy sa vings are dominated by the po wer udget, as we sa in Figure 7. Limiting po wer under high memory boundness. The applica- tions we studied so ar place light or moderate demands on the memory subsystem. assess the beha vior of our techniques for

applications that are highly memory-bound, we studied mcf, the most memory-bound application in the CPU2000 suite. As one ould xpect, limiting the po wer consumption of highly memory- bound application to lo udget results in high performance de- gradations, especially when the number of chips is lar ge. Specifi- cally assuming memory chips and maximum of outstanding miss, mcf xhibits de gradations of 23%, 8%, and 5% for udgets of 25%, 50%, and 75%, respecti ely compared to the unrestricted ecution of mcf. Despite the high performance de gradation for udget of 25%, LR U-Ordered can still

achie an ener gy sa vings of 54%. or udget of 50%, LR U-Ordered conserv es 36% en- er gy which is still quite significant. The EC-Perf ersion of LR U- Ordered is not able to conserv more ener gy (with an acceptable de gradation of 3%) than the base ersion. The impact of DDR SDRAM technology demonstrate the generality of our techniques, we applied them to DDR2 SDRAM. Specifically we modeled DDR2 with po wer states, namely ac- cessing, acti standby pre-char ge quiet, pre-char ge po werdo wn, and self-refresh [34]. The po wer consumptions and transition er heads also came from [34].

simulated memory subsystems with and ranks, where each rank corresponds to chips that are ac- cessed and po wer -managed together The 2-rank scenario is partic- ularly challenging, since the memory controller has little room to manage po wer states, i.e. an access to dif ferent rank requires state change. Ne ertheless, the 2-rank simulations sho that the performance de gradations resulting from LR U-Ordered are only slightly higher on erage, than under RDRAM. or xample, the DDR2 performance de gradations are 3.2%, on erage, with ud- get of 50% and maximum of outstanding miss. An RDRAM memory

subsystem with the same number of chips, udget of 50%, and maximum of outstanding miss leads to de gradation of 3%, on erage. Despite its slightly higher performance de grada- tions under DDR2, LR U-Ordered still conserv es 41% of the ener gy of the unrestricted ecution, on erage. The equi alent measure for the comparable RDRAM memory subsystem is 45%. In con- trast, the 8-rank simulations sho that LR U-Ordered xhibits an erage performance de gradation of 0.6% and an erage ener gy sa vings of 43%, again assuming udget of 50% and maximum of outstanding miss. 3.4 Summary Se eral interesting

observ ations can be made from these results: Knapsack and LR U-Ordered are clearly the best techniques for limiting po wer consumption. The choice between these tw techniques comes do wn to the characteristics of the en-
Page 11
vironment, such as the frequenc with which one xpects in-the-field changes to the po wer udget and the number of memory chips. LR U-Greedy and LR U-Smooth generate high erheads in most of the parameter space. ith good techniques for limiting po wer the resulting per formance de gradations are ery small for applications that impose light or moderate

requirements on the memory sub- system, en for lo udgets. Highly memory-bound appli- cations require correspondingly high udgets for good per formance. The ery act that po wer consumption is limited translates into significant ener gy sa vings. Attempting to conserv ad- ditional ener gy without xcessi performance de gradation may require high udget. When using good techniques, limiting po wer consumption is at least as ef fecti an ener gy-conserv ation approach, as doing (performance-a are) ener gy conserv ation xplicitly Gi en the ener gy benefits of limiting po wer consumption,

the po wer udget dominates all other parameters, including application characteristics, in determining ho much ener gy can be conserv ed. The number of memory chips and the characteristics of ap- plications ha significant impact on the performance of the techniques for limiting po wer The impact is not as high on the ener gy sa vings. The maximum number of outstanding cache misses does not ha significant ef fect on the performance de gradation or ener gy sa vings that our techniques produce. Our good techniques ork well for DDR2 on erage, en when the number of memory ranks is

small. 4. RELA TED ORK Our ork touches three main areas: limiting po wer consumption, managing temperature, and memory ener gy conserv ation. Ne xt, we ervie the related orks in each of these areas. Limiting po wer Felter et al. [10] were the only ones to con- sider limiting the po wer consumption of the memory subsystem. The proposed to continuously re-b udget the ailable po wer be- tween the processor and the memory subsystem. Their base po wer control mechanism as throttling scheme that limits the number of operations performed by each subsystem (instruction dispatches and memory accesses)

during an interv al of time; once the thresh- old number of operations in reached, no additional operations are performed until the ne xt interv al. Improperly selecting the interv al length may cause po wer udget violations and unnecessary perfor mance de gradation. In [1], the authors proposed Ener gy per Instruction (EPI) throt- tling for chip multiprocessors (CMPs), as ay to minimize the ecution times of multithreaded applications while limiting the po wer consumed by the CMP As their base po wer -control mecha- nism, the authors used clock throttling of each processor core to manage its

duty ycle. During periods of lo thread-le el par allelism, the CMP can spend more EPI by running the ailable thread(s) on fe wer cores, each of which at higher duty ycle. Con- ersely during periods of high thread-le el parallelism, the CMP should spend less EPI by running on more cores, each of which at lo wer duty ycle. In [18], the authors considered scaling the oltage/frequenc of each core of chip multiprocessor independently to enforce chip- le el po wer udget. Po wer mode assignments are re-e aluated peri- odically by global po wer manager based on the performance and erage po wer

consumption observ ed in the most recent period. Our ork dif fers from these contrib utions in that we focus on the memory subsystem, limit the po wer consumption strictly by dynamically changing memory de vice states (rather than limiting the number of memory accesses during an interv al), and combine the po wer limitation with xplicit ener gy conserv ation. or clusters of computers, [11, 32] used CPU oltage/frequenc scaling at each node and cluster -wide controller to limit the po wer consumption of the entire system. In both orks, the controller as responsible for deciding ho much of the po

wer udget to assign to each node. an et al. [9] assessed the po wer consumption of data centers under dif ferent orkloads and the potential benefits of limiting po wer consumption. Our ork dif fers from these three studies as we focus on stand- alone computers from hand-helds to serv ers. or clusters and data centers, coarser -grained approaches that limit the po wer consump- tion across multiple computers may indeed be more appropriate than doing so on per -serv er le el. Still, per -serv er techniques may be used in guaranteeing that each serv er does not xceed its assigned fraction of

the udget. Managing temperatur e. Researchers ha considered dynamic thermal management of processors [2, 3, 4, 14, 17, 23, 27, 30, 33, 35, 36, 37], disks [12, 19], and data centers [13, 5, 28]. Most of these contrib utions apply throttling, dynamic oltage/frequenc scaling, and/or acti vity migration when temperatures xceed some pre-established threshold. Although these orks are related to our approach of limiting po wer consumption, the all assume that the supply of po wer is pro visioned to withstand the peak po wer consumption of the dif- ferent subsystems. Ho we er there are se eral types

of scenarios in which such an assumption cannot be made. or xample, the sup- ply of po wer may be limited (as in hand-held de vices), xpensi to er -pro vision (as in data centers), or can be reduced due to unit ailures (as in blade systems with multiple po wer supplies). Our ork tar gets these types of scenarios. Conser ving memory ener gy Se eral pre vious orks ha sought to conserv memory ener gy [7, 8, 15, 16, 20, 22, 29, 41]. These papers mainly address techniques for intelligently setting idleness thresholds, and data layout techniques for increasing the achie v- able ener gy sa vings. In

particular some orks [15, 16, 20, 29] ha considered page layouts that concentrate accesses on subset of chips. did not consider these layouts; doing so ould ha made it easier for our techniques to limit po wer with little perfor mance de gradation. Instead, we focused on the more challenging scenario that uses Linux wn virtual/physical page mapping. Pre vious orks ha cast memory ener gy conserv ation as an MCKP [22, 42]. Our Knapsack technique as also formulated as an MCKP Ho we er our formulation dif fers substantially from pre- vious orks. or xample, the PS algorithm in [22] seeks to min-

imize ener gy under performance constraint, whereas Knapsack seeks to minimize performance de gradation under po wer con- straint. Furthermore, PS is epoch-based and eeps all chips at stati- cally defined po wer states during each epoch. The EC-Perf ersion of Knapsack also uses epochs, ut does not restrict chips to specific po wer states. The base ersion of Knapsack does neither Ov erall, our techniques dif fer from these pre vious contrib utions as the primarily seek to limit po wer consumption; the try to con- serv additional memory ener gy in the conte xt of this hard limit.


Page 12
More fundamentally our ork demonstrates that limiting po wer consumption can actually be used as ery ef fecti means of con- serving ener gy 5. CONCLUSIONS In this paper we studied four techniques for limiting the po wer consumption of the memory subsystem: Knapsack, LR U-Greedy LR U-Smooth, and LR U-Ordered. also studied ariations of the techniques that attempted to conserv ener gy xplicitly and to limit the resulting performance de gradation of doing so. Finally we studied the impact of dif ferent parameters on the beha vior of the techniques and ariations. Our

simulation-based aluation led us to number of interesting observ ations (Section 3.4). One impor tant observ ation is that Knapsack and LR U-Ordered are clearly su- perior to the other techniques. Another important (and fundamen- tal) observ ation is that, using these superior techniques, limiting po wer consumption is at least as ef fecti for ener gy conserv ation as state-of-the-art techniques xplicitly designed for performance- are ener gy management. It is important to emphasize: it is not surprising that limiting po wer consumption also conserv es en- er gy; what is surprising is that po

wer consumption can be limited with small enough performance de gradation er wide parameter space to mak it better than techniques xplicitly designed to con- serv ener gy without xcessi performance de gradation. Thus, major adv antage of using our po wer -limiting approach is that we can limit po wer consumption and conserv substantial ener gy at the same time. Limitations and futur ork. So ar we ha not considered mul- tiprogramming orkloads or dif ferent policies for allocating pages to memory chips. In addition, although we did study the ef fect of greater concurrenc in memory accesses in

the conte xt of the number of outstanding cache misses, we ha not xplicitly consid- ered chip multiprocessors. Finally we presented parameter space study of the beha vior of our techniques, ut did not consider ho to select the ideal po wer udget in dif ferent scenarios. Addressing these limitations is the focus of our future ork. Ackno wledgements ould lik to thank Luiz Barroso, Eugene Gorbato artha Ranganathan, the members of the ice-V ersa seminar and the ano- ymous referees for comments that helped impro this paper 6. REFERENCES [1] M. Anna aram, E. Grocho wski, and J. Shen. Mitigating

Amdahl.s La Through EPI Throttling. In Pr oceedings of ISCA June 2005. [2] Bellosa, S. ellner M. aitz, and A. eissel. Ev ent-Dri en Ener gy Accounting of Dynamic Thermal Management. In Pr oceedings of COLP September 2003. [3] D. Brooks and M. Martonosi. Dynamic Thermal Management for High-Performance Microprocessors. In Pr oceedings of HPCA January 2001. [4] Chaparro, G. Magklis, J. Gonzalez, and A. Gonzalez. Distrib uting the Frontend for emperature Reduction. In Pr oceedings of HPCA February 2005. [5] J. Choi, Kim, A. Si asubramaniam, J. Srebric, Q. ang, and J. Lee. Modeling and Managing

Thermal Pro˛les of Rack-Mounted Serv ers with ThermoStat. In Pr oceedings of HPCA February 2007. [6] Standard Performance Ev aluation Corporation. Spec2000. http://www .spec.or g. [7] Delaluz, M. Kandemir N. ijaykrishnan, A. Si asubramaniam, and M. J. Irwin. Hardw are and Softw are echniques for Controlling DRAM Po wer Modes. IEEE ansactions on Computer 50(11), 2001. [8] X. an, C. Ellis, and A. Lebeck. Memory Controller Policies for DRAM Po wer Management. In Pr oceedings of ISLPED August 2001. [9] X. an, .-D. eber and L. A. Barroso. Po wer Pro visioning for arehouse-sized Computer. In Pr

oceedings of ISCA June 2007. [10] Felter K. Rajamani, eller and C. Rusu. Performance-Conserving Approach for Reducing Peak Po wer Consumption in Serv er Systems. In Pr oceedings of ICS June 2005. [11] M. Femal and Freeh. Boosting Data Center Performance Through Non-Uniform Po wer Allocation. In Pr oceedings of ICA June 2005. [12] S. Gurumurthi, A. Si asubramaniam, and Natarajan. Disk Dri Roadmap from the Thermal Perspecti e: Case for Dynamic Thermal Management. In Pr oceedings of ISCA June 2005. [13] Heath, A. Centeno, Geor ge, Jaluria, and R. Bianchini. Mercury and Freon: emperature Emulation

and Management in Serv er Systems. In Pr oceedings of ASPLOS October 2006. [14] S. Heo, K. Barr and K. Asano vic. Reducing Po wer Density Through Acti vity Migration. In Pr oceedings of ISLPED August 2003. [15] H. Huang, Pillai, and K. G. Shin. Design and Implementation of Po wer -A are irtual Memory. In Pr oceedings of USENIX June 2003. [16] H. Huang, K. Shin, C. Lefur gy K. Rajamani, eller E. Hensber gen, and Ra wson. Co-operati Softw are-Hardw are Po wer Management for Main Memory. In Pr oceedings of CS December 2004. [17] M. Huang, J. Renau, S-M. oo, and J. orrellas. Frame ork for Dynamic

Ener gy Ef ˛cienc and emperature Management. In Pr oceedings of Micr December 2000. [18] C. Isci, A. Buyuktosunoglu, C.-Y Cher Bose, and M. Martonosi. An Analysis of Ef ˛cient Multi-Core Global Po wer Management Policies: Maximizing Performance for Gi en Po wer Budget. In Pr oceedings of Micr December 2006. [19] Kim, S. Gurumurthi, and A. Si asubramaniam. Understanding the Performance-T emperature Interactions in Disk I/O of Serv er orkloads. In Pr oceedings of HPCA February 2006. [20] A. R. Lebeck, X. an, H. Zeng, and C. S. Ellis. Po wer -A are age Allocation. In Pr oceedings of

ASPLOS No ember 2000. [21] C. Lefur gy K. Rajamani, Ra wson, Felter M. Kistler and eller Ener gy Management for Commercial Serv ers. IEEE Computer 36(12), December 2003. [22] X. Li, Z. Li, M. Da vid, Zhou, Zhou, S. Adv e, and S. umar Performance-Directed Ener gy Management for Main Memory and Disks. In Pr oceedings of ASPLOS October 2004. [23] Li, D. Brooks, Z. Hu, and K. Skadron. Performance, Ener gy and Thermal Considerations for SMT and CMP Architectures. In Pr oceedings of HPCA February 2005. [24] M. Lipasti, C. ilk erson, and J. Shen. alue Locality and Load alue Prediction. In Pr

oceedings of ASPLOS October 1996. [25] S. Martello and oth. Knapsac Pr oblems: Algorithms and Computer Implementations ile 1990. [26] MediaBench. Mediabench. http://cares.icsl.ucla.edu/MediaBench/. [27] A. Merk el and Bellosa. Balancing Po wer Consumption in Multiprocessor Systems. In Pr oceedings of Eur osys 2006 April 2006. [28] J. Moore, J. Chase, Ranganathan, and R. Sharma. Making Scheduling Cool: emperature-A are Resource Assignment in Data Centers. In Pr oceedings of USENIX April 2005. [29] ande Jiang, Zhou, and R. Bianchini. DMA-A are Memory Ener gy Management. In Pr oceedings of HPCA

February 2006. [30] M. Po well, M. Gomaa, and N. ijaykumar Heat-and-Run: Le eraging SMT and CMP to Manage Po wer Density Through the Operating System. In Pr oceedings of ASPLOS October 2004. [31] Ramb us. RDRAM. http://www .ramb us.com. [32] Ranganathan, Leech, D. Irwin, and J. Chase. Ensemble-Le el Po wer Management for Dense Blade Serv ers. In Pr oceedings of ISCA June 2006. [33] E. Rohou and M. D. Smith. Dynamically Managing Processor emperature and Po wer. In Pr oceedings of FDO No ember 1999. [34] Samsung. 512Mb E-die DDR2 SDRAM Speci˛cation. http://www

.samsung.com/Products/Semiconductor/DDR DDR2/- DDR2SDRAM/Component/512Mbit/K4T51083QE/ds k4t51xx3qe re v14.pdf. [35] L. Shang, L.-S. Peh, A. umar and N. Jha. Characterization and Management of On-Chip Netw orks. In Pr oceedings of Micr December 2004. [36] K. Skadron, M. Stan, Huang, S. elusamy K. Sankaranarayanan, and D. arjan. emperature-A are Microarchitecture. In Pr oceedings of ISCA June 2003. [37] J. Srini asan, S. Adv e, Bose, and J. Ri ers. The Case for Lifetime Reliability-A are Microprocessors. In Pr oceedings of ISCA June 2004. [38] irtutech. Simics. http://www .simics.net. [39] J.

J. i, S. odakara, R. Sendag, D. J. Lilja, and D. M. Ha wkins. Characterizing and comparing pre ailing simulation techniques. In Pr oceedings of HPCA February 2005. [40] L. Zhang, Z. ang, M. ark er B.K. Mathe L. Schaelick e, J.B. Carter .C. Hsieh, and S.A. McK ee. The Impulse Memory Controller. IEEE ansactions on Computer s, Special Issue on Advances in High-P erformance Memory Systems No ember 2001. [41] Zhou, ande J. Sundaresan, A. Raghuraman, Zhou, and S. umar Dynamic racking of age Miss Ratio Curv for Memory Management. In Pr oceedings of ASPLOS October 2004. [42] Q. Zhu and Zhou. Po wer -A

are Storage Cache Management. IEEE ansactions on Computer 54(5), 2005.