Udipi University of Utah Salt Lake City UT udipicsutahedu Naveen Muralimanohar HewlettPackard Laboratories Palo Alto CA naveenmuralihpcom Niladrish Chatterjee University of Utah Salt Lake City UT nilcsutahedu Rajeev Balasubramonian University of Uta ID: 33048
Download Pdf The PPT/PDF document "Rethinking DRAM Design and Organization ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
RethinkingDRAMDesignandOrganizationforEnergy-ConstrainedMulti-CoresAniruddhaN.UdipiUniversityofUtahSaltLakeCity,UTudipi@cs.utah.eduNaveenMuralimanoharHewlett-PackardLaboratoriesPaloAlto,CAnaveen.murali@hp.comNiladrishChatterjeeUniversityofUtahSaltLakeCity,UTnil@cs.utah.eduRajeevBalasubramonianUniversityofUtahSaltLakeCity,UTrajeev@cs.utah.eduAlDavisUniversityofUtahSaltLakeCity,UTald@cs.utah.eduNormanP.JouppiHewlett-PackardLaboratoriesPaloAlto,CAnorm.jouppi@hp.comABSTRACTDRAMvendorshavetraditionallyoptimizedthecost-per-bitmetric,oftenmakingdesigndecisionsthatincuren-ergypenalties.AprimeexampleistheoverfetchfeatureinDRAM,whereasinglerequestactivatesthousandsofbit-linesinmanyDRAMchips,onlytoreturnasinglecachelinetotheCPU.Thefocusoncost-per-bitisquestionableinmodern-dayserverswhereoperatingcostscaneasilyex-ceedthepurchasecost.Moderntechnologytrendsarealsoplacingverydierentdemandsonthememorysystem:(i)queuingdelaysareasignicantcomponentofmemoryac-cesstime,(ii)thereisahighenergypremiumforthelevelofreliabilityexpectedforbusiness-criticalcomputing,and(iii)thememoryaccessstreamemergingfrommulti-coresystemsexhibitslimitedlocality.AllofthesetrendsnecessitateanoverhaulofDRAMarchitecture,evenifitmeansaslightcompromiseinthecost-per-bitmetric.Thispaperexaminesthreeprimaryinnovations.TherstisamodicationtoDRAMchipmicroarchitecturethatre-tainsthetraditionalDDRxSDRAMinterface.SelectiveBit-lineActivation(SBA)waitsforbothRAS(rowaddress)andCAS(columnaddress)signalstoarrivebeforeactivatingex-actlythosebitlinesthatprovidetherequestedcacheline.SBAreducesenergyconsumptionwhileincurringslightareaandperformancepenalties.Thesecondinnovation,SingleSubarrayAccess(SSA),fundamentallyre-organizesthelay-outofDRAMarraysandthemappingofdatatothesearrayssothatanentirecachelineisfetchedfromasinglesubarray.Itrequiresadierentinterfacetothememorycontroller,reducesdynamicandbackgroundenergy(byabout6Xand5X),incursaslightareapenalty(4%),andcanevenleadtoperformanceimprovements(54%onaverage)byreduc-ingqueuingdelays.Thethirdinnovationfurtherpenalizesthecost-per-bitmetricbyaddingachecksumfeaturetoeachcacheline.Thischecksumerror-detectionfeaturecanthenPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.ISCA'10,June1923,2010,Saint-Malo,France.Copyright2010ACM978-1-4503-0053-7/10/06...$10.00.beusedtobuildstrongerRAID-likefaulttolerance,includ-ingchipkill-levelreliability.SuchatechniqueisespeciallycrucialfortheSSAarchitecturewheretheentirecachelineislocalizedtoasinglechip.ThisDRAMchipmicroarchi-tecturalchangeleadstoadramaticreductionintheenergyandstorageoverheadsforreliability.Theproposedarchitec-tureswillalsoapplytootheremergingmemorytechnologies(suchasresistivememories)andwillbelessdisruptivetostandards,interfaces,andthedesign\rowiftheycanbein-corporatedintorst-generationdesigns.CategoriesandSubjectDescriptorsB.3.1[MemoryStructures]:SemiconductorMemories|Dynamicmemory(DRAM);B.3.2[MemoryStructures]:DesignStyles|Primarymemory;B.8.1[PerformanceandReliability]:Reliability,TestingandFault-Tolerance;C.5.5[ComputerSystemImplementation]:ServersGeneralTermsDesign,Performance,ReliabilityKeywordsDRAMArchitecture,Energy-eciency,Locality,Chipkill,Subarrays1.INTRODUCTIONThecomputinglandscapeisundergoingamajorchange,primarilyenabledbyubiquitouswirelessnetworksandtherapidincreaseintheusageofmobiledeviceswhichaccesstheweb-basedinformationinfrastructure.ItisexpectedthatmostCPU-intensivecomputingmayeitherhappeninservershousedinlargedatacenters,e.g.,cloudcomputingandotherwebservices,orinmany-corehigh-performancecomputing(HPC)platformsinscienticlabs.Inbothsituations,itisexpectedthatthememorysystemwillbeproblematicintermsofperformance,reliability,andpowerconsumption.Thememorywallisnotnew:longDRAMmemorylaten-cieshavealwaysbeenaproblem.Giventhatlittlecanbedoneaboutthelatencyproblem,DRAMvendorshavecho-sentooptimizetheirdesignsforimprovedbandwidth,in-creaseddensity,andminimumcost-per-bit.Withtheseob-jectivesinmind,afewDRAMarchitectures,standards,and interfaceswereinstitutedinthe1990sandhavepersistedsincethen.However,theobjectivesindatacenterserversandHPCplatformsofthefuturewillbeverydierentthanthosethatarereasonableforpersonalcomputers,suchasdesktopmachines.Asaresult,traditionalDRAMarchitec-turesarehighlyinecientfromafuturesystemperspective,andareinneedofamajorrevamp.ConsiderthefollowingtechnologicaltrendsthatplaceverydierentdemandsonfutureDRAMarchitectures:Energy:Whileenergywasneverarst-orderdesigncon-straintinpriorDRAMsystems,ithascertainlyemergedastheprimaryconstrainttoday,especiallyindatacen-ters.Energyeciencyindatacentershasalreadybeenhighlightedasanationalpriority[50].Manystudiesat-tribute25-40%oftotaldatacenterpowertotheDRAMsystem[11,33,34,37].ModernDRAMarchitecturesareill-suitedforenergy-ecientoperationbecausetheyaredesignedtofetchmuchmoredatathanrequired.Thisoverfetchwastesdynamicenergy.Today'sDRAMsem-ploycoarse-grainedpower-downtacticstoreduceareaandcost,butnergrainedapproachescanfurtherreduceback-groundenergy.Reducedlocality:Single-coreworkloadstypicallyexhibithighlocality.Consequently,currentDRAMsfetchmanykilobytesofdataoneveryaccessandkeeptheminopenrowbuerssothatsubsequentrequeststoneighboringdataelementscanbeservicedquickly.Thehighdegreeofmulti-threadinginfuturemulti-cores[42]impliesthatmemoryrequestsfrommultipleaccessstreamsgetmulti-plexedatthememorycontroller,thusdestroyingalargefractionoftheavailablelocality.Theseverityofthisprob-lemwillincreasewithincreasedcoreandmemorycon-trollercountsthatareexpectedforfuturemicroproces-sorchips.Thistrendisexacerbatedbytheincreaseduseofaggregatedmemorypools(\memoryblades"thatarecomprisedofmanycommodityDIMMs)thatservesev-eralCPUsocketsinaneorttoincreaseresourceutiliza-tion[34].ThismandatesthatfutureDRAMarchitecturesplacealowerpriorityonlocalityandahigherpriorityonparallelism.QueuingDelays:Forseveralyears,queuingdelaysatthememorycontrollerwererelativelysmallbecauseasinglecoretypicallyhadrelativelyfewpendingmemoryoper-ationsandDRAMsystemswereabletosteeplyincreasepeakmemorybandwidtheveryyear[20].Inthefuture,thenumberofpinsperchipisexpectedtogrowveryslowly.The2007ITRSRoad-map[26]expectsa1.47xincreaseinthenumberofpinsoveran8-yeartime-frame{overthesameperiod,Moore'sLawdictatesatleasta16xincreaseinthenumberofcores.Thisimpliesthatrequestsfrommanycoreswillbecompetingtoutilizethelimitedpinbandwidth.Severalstudieshavealreadyhigh-lightedtheemergenceofqueuingdelayasamajorbottle-neck[24,31,40,41,44,56].ADRAMarchitecturethatisgearedtowardshigherparallelismwilllikelybeabletode-queuerequestsfasterandbetterutilizetheavailablelimiteddatabandwidth.EcientReliability:RecentstudieshavehighlightedtheneedforDRAMarchitecturesthatareresilienttosinglefaultsorevenfailurewithinanentireDRAMchip[8,46],especiallyindatacenterplatforms.Becausethesefault-tolerantsolutionsarebuiltuponcommodityDRAMchips,theyincurveryhighoverheadsintermsofenergyandcost.NewDRAMarchitecturescanprovidemuchmoreecientreliabilityiffault-tolerantfeaturesareintegratedintotheDRAMchipmicroarchitectureatdesigntime.LowerrelevanceofDRAMchiparea:DRAMvendorshavelongoptimizedthecost-per-bitmetric.However,giventhatdatacentersconsumeseveralbillionkilowatthoursofenergyeveryyear[50],ithasbeenshownthatthe3-yearoperatingenergycostsoftoday'sdatacentersequalthecapitalacquisitioncosts[33].Therefore,itmaynowbeacceptabletoincuraslightlyhighercost-per-bitwhenpurchasingDRAMaslongasitleadstosignicantlylowerenergyfootprintsduringoperation.ThedesignofDRAMdevicesspecicallyaddressingthesetrendshas,tothebestofourknowledge,notbeenpreviouslystudiedandisnowmorecompellingthanever.WeattempttofundamentallyrethinkDRAMmicroarchitectureandor-ganizationtoachievehighlyreliable,highperformanceop-erationwithextremelylowenergyfootprints,allwithinac-ceptableareabounds.Inthiswork,weproposetwoinde-pendentdesigns,bothattemptingtoactivatetheminimumcircuitryrequiredtoreadasinglecacheline.Wemakethefollowingthreesignicantcontributions:WeintroduceandevaluatePostedRASincombinationwithaSelectiveBitlineActivation(SBA)scheme.ThisentailsarelativelysimplechangetoDRAMmicroarchi-tecture,withonlyaminorchangetotheDRAMinterface,toprovidesignicantdynamicenergysavings.WeproposeandevaluateareorganizationofDRAMchipsandtheirinterface,sothatcachelinescanbereadviaaSingleSubarrayAccess(SSA)inasingleDRAMchip.Thisapproachtradesohigherdatatransfertimesforgreater(dynamicandbackground)energysavings.Inordertoprovidechipkill-levelreliability[18,35]eventhoughwearereadingacachelineoutofasingleDRAMdevice,weproposeaddingachecksumtoeachcachelineintheSSADRAMtoprovideerrordetection.WethenevaluatetheuseofRAIDtechniquestoreconstructcachelinesintheeventofachipfailure.WhilethisstudyfocusesonDRAMasanevaluationvehi-cle,theproposedarchitectureswilllikelyapplyjustaswelltootheremergingstoragetechnologies,suchasphasechangememory(PCM)andspintorquetransferRAM(STT-RAM).2.BACKGROUNDANDMOTIVATION2.1DRAMBasicsandBaselineOrganizationWerstdescribethetypicalmodernDRAMarchitec-ture[27].Formostofthepaper,ourdiscussionwillfocusonthedominantDRAMarchitecturetoday:JEDEC-styleDDRxSDRAM,anexampleisshowninFigure1.Modernprocessors[45,48,54]oftenintegratememorycon-trollersontheprocessordie.Eachmemorycontrollerisconnectedtooneortwodedicatedo-chipmemorychan-nels.ForJEDECstandardDRAM,thechanneltypicallyhasa64-bitdatabus,a17-bitrow/columnaddressbus,andan8-bitcommandbus[38].Multipledualin-linemem-orymodules(DIMMs)canbeaccessedviaasinglememorychannelandmemorycontroller.EachDIMMtypicallycom-prisesmultipleranks,eachrankconsistingofasetofDRAM Array 1/8 th fth 1/8 th o f th rowbuffer Onewordof One word of dataoutput Rank DRAMchipordevice Bank DIMM O hi Memorybusorchannel O nrc hi MemoryController Figure1:AnexampleDDRxSDRAMarchitecturewith1DIMM,2ranks,and8x4DRAMchipsperrank.chips.Wewillcallthisarank-set.Exactlyonerank-setisactivatedoneverymemoryoperationandthisisthesmall-estnumberofchipsthatneedtobeactivatedtocompleteareadorwriteoperation.Delaysontheorderofafewcyclesareintroducedwhenthememorycontrollerswitchesbetweenrankstosupportelectricalbusterminationrequire-ments.TheproposedDRAMarchitectureisentirelyfocusedontheDRAMchips,andhasneitherapositiveornegativeeectonrankissues.Figure1showsanexampleDIMMwith16totalDRAMchipsformingtworank-sets.EachDRAMchiphasanintrinsicwordsizewhichcorre-spondstothenumberofdataI/Opinsonthechip.AnxNDRAMchiphasawordsizeofN,whereNreferstothenumberofbitsgoingin/outofthechiponeachclocktick.Fora64-bitdatabusandx8chips,arank-setwouldrequire8DRAMchips(Figure1onlyshows8x4chipsperrank-settosimplifythegure).IftheDIMMsupportsECC,thedatabusexpandsto72-bitsandtherank-setwouldconsistof9x8DRAMchips.Whenarankisselected,allDRAMchipsintherank-setreceiveaddressandcommandsignalsfromthememorycontrolleronthecorrespondingsharedbuses.EachDRAMchipisconnectedtoasubsetofthedatabus;ofthe64-bitdatapacketbeingcommunicatedonthebusonaclockedge,eachx8chipreads/writesan8-bitsubset.Arankisitselfpartitionedintomultiplebanks,typically4-16.Eachbankcanbeconcurrentlyprocessingadierentmemoryrequest,thusaordingalimitedamountofmem-oryparallelism.EachbankisdistributedacrosstheDRAMchipsinarank;theportionofabankineachchipwillbere-ferredtoasasub-bank.Theorganizationofasub-bankwillbedescribedinthenextparagraph.Whenthememorycon-trollerissuesarequestforacacheline,alltheDRAMchipsintherankareactivatedandeachsub-bankcontributesaportionoftherequestedcacheline.BystripingacachelineacrossmultipleDRAMchips,theavailablepinandchannelbandwidthforthecachelinetransfercanbeenhanced.Ifthedatabuswidthis64bitsandacachelineis64bytes,thecachelinetransferhappensinanburstof8datatransfers.IfachipisanxNpart,eachsub-bankisitselfpartitionedintoNarrays(seeFigure1).EacharraycontributesasinglebittotheN-bittransferonthedataI/Opinsforthatchiponaclockedge.Anarrayhasseveralrowsandcolumnsofsingle-bitDRAMcells.AcachelinerequeststartswithaRAScommandthatcarriesthesubsetofaddressbitsthatidentifythebankandtherowwithinthatbank.Eacharraywithinthatbanknowreadsoutanentirerow.Thebitsreadoutaresavedinlatches,referredtoastherowbuer.Therowisnowconsideredopened.Thepagesizeorrowbuersizeisdenedasthenumberofbitsreadoutofallarraysinvolvedinabankaccess(usually4-16KB).Ofthese,onlyacachelineworthofdata(identiedbytheCAScommandanditsassociatedsubsetofaddressbits)iscommunicatedonthememorychannelforeachCPUrequest.Eachbankhasitsownrowbuer,sotherecanpotentiallybe4-16openrowsatanytime.Thebankscanbeaccessedinparallel,butthedatatransfershavetobeserializedovertheshareddatabus.Iftherequesteddataispresentinanopenrow(arowbuerhit),thememorycontrollerisawareofthis,anddatacanbereturnedmuchfaster.Iftherequesteddataisnotpresentinthebank'srowbuer(arowbuermiss),thecurrentlyopenrow(ifoneexists)hastorstbeclosedbeforeopeningthenewrow.Topreventtheclosingoftherowfrombeingonthecriticalpathforthenextrowbuermiss,thecontrollermayadoptaclose-pagepolicythatclosestherowrightafterreturningtherequestedcacheline.Alternatively,anopen-pagepolicykeepsarowopenuntilthebankreceivesarequestforadierentrow.Asanexamplesystem,considera4GBsystem,withtwo2GBranks,eachconsistingofeight256MBx8,4-bankdevices,servinganL2witha64bytecachelinesize.OneveryrequestfromtheL2cache,eachdevicehastoprovide8bytesofdata.Eachofthe4banksina256MBdeviceissplitinto8arraysof8MBeach.Ifthereare65,536rowsof1024columnsofbitsineacharray,arowaccessbringsdown1024bitsperarrayintotherowbuer,givingatotalrowbuersizeof65,536bitsacross8chipsof8arrayseach.Thepagesizeistherefore65,536bits(8KBytes)andofthese,only64Bytesarenallyreturnedtotheprocessor,witheachoftheeightchipsbeingresponsiblefor64bitsofthecacheline.Suchabaselinesystemusuallysignicantlyunder-utilizesthebitsitreadsout(intheaboveexample,onlyabout0.8%oftherowbuerbitsareutilizedforasinglecachelineaccess)andendsupunnecessarilyactivatingvariouscircuitsacrosstherank-set.2.2MotivationalDataRecentstudieshaveindicatedthehighenergyneedsofdatacenters[50]andthatmemorycontributesupto40%oftotalserverpowerconsumption[11,33,37].Westartwithaworkloadcharacterizationonoursimulationinfrastructure(methodologydetailsinSection4.1).Figure2showsthetrendofsteeplydroppingrow-buerhitratesasthenumberofthreadssimultaneouslyaccessingmemorygoesup.Weseeaverageratesdropfromover60%fora1coresystemto35%fora16coresystem.Wealsoseethatwheneverarowisfetchedintotherow-buer,thenumberoftimesitisusedbeforebeingclosedduetoacon\rictisoftenjustoneortwo(Figure3).Thisindicatesthatevenonbenchmarkswithhighlocalityandgoodaveragerowbuerhitrates(forexample,cg),alargenumberofpagesstilldon'thavemuchreuseintherow-buer.Thesetrendshavealsobeenob-servedinpriorworkonMicro-Pages[47].Thismeansthattheenergycostsofactivatinganentire8KBrowisamor-tizedoververyfewaccesses,wastingsignicantenergy. 10 2030405060708090100 o bufferhitrate(%) Core Core 102030405060708090100 Rowbufferhitrate(%) Core Core 16Core Figure2:Rowbufferhitratetrend 0.010.020.030.040.050.060.070.080.090.0100.0PercentageofRowFetches UseCount3 UseCount UseCount UseCount RBHitrate(%) 0.010.020.030.040.050.060.070.080.090.0100.0 PercentageofRowFetches UseCount3 UseCount UseCount UseCount RBHitrate(%) Figure3:Rowusecountfor8cores3.PROPOSEDARCHITECTUREWestartwiththepremisethatthetraditionalrow-buerlocalityassumptionisnolongervalid,andtrytondanenergy-optimalDRAMdesignwithminimalimpactsonareaandlatency.Ourrstnoveldesign(SelectiveBitlineActi-vation,Section3.1)requiresminorchangestoDRAMchipmicroarchitecture,butiscompatiblewithexistingDRAMstandardsandinterfaces.Thesecondnoveldesign(SingleSubarrayAccess,Section3.2)requiresnon-trivialchangestoDRAMchipmicroarchitectureanditsinterfacetothememorycontroller.Section3.3describesournovelchipkillsolutionfortheproposedarchitecture.3.1SelectiveBitlineActivation(SBA)Inaneorttomitigatetheoverfetchproblemwithmin-imaldisruptiontoexistingdesignsandstandards,wepro-posethefollowingtwosimplemodications:(i)weactivateamuchsmallersegmentofthewordlineand(ii)weacti-vateonlythosebitlinescorrespondingtotherequestedcacheline.Notethatwewillstillneedawirespanningthearraytoidentifytheexactsegmentofwordlinethatneedstobeactivatedbutthisisverylightlyloadedandthereforehaslowdelayandenergy.Thus,wearenotchangingthewaydatagetslaidoutacrossDRAMchiparrays,buteveryac-cessonlybringsdowntherelevantcachelineintotherowbuer.Asaresult,thenotionofanopen-pagepolicyisnowmeaningless.Aftereveryaccess,thecachelineisimmedi-atelywrittenback.Mostoftheperformancedierencefromthisinnovationisbecauseoftheshifttoaclose-pagepolicy:forworkloadswithlittlelocality,thiscanactuallyresultinperformanceimprovementsasthepageprechargeafterwrite-backistakenothecriticalpathofthesubsequentrowbuermiss.Next,wediscussthemicroarchitecturalmodicationsinmoredetail. RX0RX1RX2 METAL MEMORY ARRAYSMEMORY ARRAYS SWL MWL0 SWL SWL METAL POLY-Si BITLINESBITLINESBITLINESBITLINES SWL MWL1 SWL SWL MWL1 SWLSWLSWL MWL Main Wordline SWL Sub Wordline RX Re g ion SelectFigure courtesy VLSI Memory Chip Design, KIth g K . It o h Figure4:Hierarchicalwordlinewithregionselect.MemorysystemshavetraditionallymultiplexedRASandCAScommandsonthesameI/Olinesduetopincountlimitations.Thissituationisunlikelytochangeduetotech-nologicallimitations[26]andisahardconstraintforDRAMoptimization.Inatraditionaldesign,oncetheRASarrives,enoughinformationisavailabletoactivatetheappropriatewordlinewithinthearray.Thecellsinthatrowplacetheirdataonthecorrespondingbitlines.Oncetherow'sdataislatchedintotherowbuer,theCASsignalisusedtoreturnsomefractionofthemanybitsreadfromthatarray.Inourproposeddesign,insteadoflettingtheRASimmediatelyac-tivatetheentirerowandallthebitlines,wewaituntiltheCAShasarrivedtobeginthearrayaccess.TheCASbitsidentifythesubsetoftherowthatneedstobeactivatedandthewordlineisonlydriveninthatsection.Correspondingly,onlythosebitlinesplacedataintherowbuer,savingtheactivationenergyoftheremainingbits.Therefore,weneedtheRASandtheCASbeforestartingthearrayaccess.SincetheRASarrivesearly,itmustbestoredinaregisteruntiltheCASarrives.WerefertothisprocessasPosted-RAS1.BecausewearenowwaitingfortheCAStobeginthear-rayaccess,someadditionalcycles(ontheorderof10CPUcycles)areaddedtotheDRAMlatency.Weexpectthisim-pact(quantiedinSection4)toberelativelyminorbecauseofthehundredsofcyclesalreadyincurredoneveryDRAMaccess.Noteagainthatthischangeiscompatiblewithex-istingJEDECstandards:thememorycontrollerissuesthesamesetofcommands,wesimplysavetheRASinaregisteruntiltheCASarrivesbeforebeginningthearrayaccess.Theselectivebitlineactivationismadepossiblebyonlyactivatingasmallsegmentofthewordline.Weemployhier-archicalwordlinestofacilitatethis,atsomeareacost.EachwordlineconsistsofaMainWordline(MWL),typicallyruninrst-levelmetal,controllingSub-Wordlines(SWL),typ-icallyruninpoly,whichactuallyconnecttothememorycells(seeFigure4).TheMWLisloadedonlybyafew\AND"gatesthatenablethesub-wordlines,signicantlyre-ducingitscapacitance,andthereforeitsdelay.\RegionSe-lect(RX)"signalscontrolactivationofspecicSWLs.HierarchicalwordlineshavebeenpreviouslyproposedforDRAMs[25]toreducedelay(ratherthanenergy).Untilnow,othertechniques(metalshuntedwordlines[28],forin-stance)partiallyachievedwhathasbeenperceivedasthe 1Manymemorycontrollersintroduceagapbetweentheis-sueoftheRASandCASsothattheCASarrivesjustastherowbuerisbeingpopulatedandthedevice'sTrcdcon-straintissatised[27].SomememorysystemssendtheCASimmediatelyaftertheRAS.TheCASisthensavedinareg-isterattheDRAMchipuntiltherowbuerisready.ThisisreferredtoasPosted-CAS[29].WerefertoourschemeasPosted-RASbecausetheRASissavedinaregisteruntilthearrivaloftheCAS. advantageofhierarchicalwordlines:signicantreductionsinwordlinedelay.Inashuntedwordline,ametalwordlineisstitchedtothelow-pitchpolywordlineatregularinter-valsbymetal-polycontacts.Thisreducesthewordlinede-laybylimitingthehighresistancepolytoasmalldistancewhilesavingareabyhavingonlyafewmetal-polycontacts.Theincreasedareacostsofhierarchicalwordlineshavethere-forenotbeenjustiablethusfar.Now,withtheincreasingimportanceofenergyconsiderations,webelievethatusinghierarchicalwordlinesisnotonlyacceptable,butactuallynecessary.NotethatwordlinesdonotcontributeasmuchtooverallDRAMenergy,sothisfeatureisimportantnotforitswordlineenergysavings,butbecauseitenablesselectivebitlineactivation.Inourproposeddesign,asubsetoftheCASaddressisusedtotriggertheRXsignal,reducingtheactivationareaandwordline/bitlineenergy.NotethatsincetheMWLisnotdirectlyconnectedtothememorycells,theactivationoftheMWLacrossthearraydoesnotresultindestructionofdata,sinceonlythesmallsubsetofcellsconnectedtotheactiveSWLreadtheirdataout.Weincorporatedananalyticalmodelforhierarchicalword-linesintoCACTI6.5[39,49](moredetailsinSection4.1)toquantifytheareaoverhead.ForthespecicDRAMpartdescribedinSection4.1,weobservedthatanareaoverheadof100%wasincurredwhenenoughSWLswereintroducedtoactivateexactlyonecachelineinabank.ThisisbecauseofthehighareaoverheadintroducedbytheANDgateandRXsignalsforafewmemorycells.Whilethisresultsinactivatingaminimumnumberofbitlines,thecostmaybeprohibitive.However,wecantrade-oenergyforlowercostbynotbeingasselective.Ifweweretoinsteadreadout16cachelines,theSWLsbecome16timeslonger.Thisstillleadstohighenergysavingsoverthebaseline,andamoreacceptableareaoverheadof12%.MostofourresultsinSec-tion4pertaintothismodel.Eventhoughwearereadingout16cachelines,wecontinuetousetheclose-pagepolicy.Insummary,theSBAmechanism(i)reducesbitlineandwordlinedynamicenergybyreadingoutalimitednumberofcachelinesfromthearrays(tosignicantlyreduceover-fetch),(ii)impactsperformance(negativelyorpositively)byusingaclose-pagepolicy,(iii)negativelyimpactsperfor-mancebywaitingforCASbeforestartingarrayaccess,(iv)increasesareaandcostbyrequiringhierarchicalwordlines,andnally(v)doesnotimpacttheDRAMinterface.Aswewilldiscusssubsequently,thismechanismdoesnotimpactanychipkillsolutionsfortheDRAMsystembecausethedataorganizationacrossthechipshasnotbeenchanged.3.2SingleSubarrayAccess(SSA)WhiletheSBAdesigncaneliminateoverfetch,itisstillanattempttoshoehorninenergyoptimizationsinaman-nerthatconformstomodern-dayDRAMinterfacesanddatalayouts.Giventhatwehavereachedanin\rectionpoint,amajorrethinkofDRAMdesigniscalledfor.Anenergy-ecientarchitecturewillalsoberelevantforotheremergingstoragetechnologies.Thissub-sectiondenesanenergy-optimizedarchitecture(SSA)thatisnotencumberedbyex-istingstandards.ManyfeaturesincurrentDRAMshavecontributedtobet-terlocalityhandlingandlowcost-per-bit,butalsotohighenergyoverhead.Arraysaredesignedtobelargestructuressothattheperipheralcircuitryisbetteramortized.WhileDRAMscanallowlow-powersleepmodesforarrays,the ONEDRAMCHIP ADDR/CMD BUS 64 Bytes Subarray ONEDRAMCHIP DIMM ... Subarray Bitlines Rowbuffer DIMM 8 8 Rowbuffer 8 8 8 8 8 8 DATA BUS MEMORY CONTROLLER Bank Interconnect I/O Figure5:SSADRAMArchitecture.largesizeofeacharrayimpliesthatthepower-downgranu-larityisrathercoarse,oeringfewerpower-savingopportu-nities.SinceeachDRAMchiphaslimitedpinbandwidth,acachelineisstripedacrossalltheDRAMchipsonaDIMMtoreducethedatatransfertime(andalsotoimproverelia-bility).Asaresult,asingleaccessactivatesmultiplechips,andmultiplelargearrayswithineachchip.Overview:Toovercometheabovedrawbacksandminimizeenergy,wemovetoanextrememodelwhereanentirecachelineisreadoutofasinglesmallarrayinasingleDRAMchip.Thissmallarrayishenceforthreferredtoasa\sub-array".Figure5showstheentirememorychannelandhowvarioussub-componentsareorganized.Thesubarrayisaswideasacacheline.SimilartoSBA,weseeadramaticre-ductionindynamicenergybyonlyactivatingenoughbitlinestoreadoutasinglecacheline.Further,remaininginactivesubarrayscanbeplacedinlow-powersleepmodes,savingbackgroundenergy.TheareaoverheadofSSAislowerthanthatofSBAsincewedividetheDRAMarrayatamuchcoarsergranularity.IftheDRAMchipisanx8part,weeitherneedtoprovide8wiresfromeachsubarraytotheI/Opinsorprovideasinglewireandserializethetransfer.WeadopttheformeroptionandasshowninFigure5,thesubarraysplacetheirdataonashared8-bitbus.Inaddition,sincetheentirecachelineisbeingreturnedviathelimitedpinsonasinglechip,ittakesmanymorecyclestoeectthedatatransfertotheCPU.Thus,thenewdesignclearlyincursahigherDRAMlatencybecauseofslowdatatransferrates.Italsoonlysupportsaclose-pagepolicy,whichcanimpactperformanceeitherpositivelyornegatively.Ontheotherhand,thedesignhasmuchhigherconcurrency,aseachDRAMchipcanbesimultaneouslyservicingadierentcacheline.Sinceeachchipcanimplementseveralindependentsubarrays,therecanalsobemuchhigherintra-chiporbank-levelconcurrency.Wenextexamineournewdesigningreaterdetail.MemoryControllerInterface:Justasinthebaseline,asingleaddress/commandbusisusedtocommunicatewithallDRAMchipsontheDIMM.TheaddressisprovidedintwotransfersbecauseofpinlimitationsoneachDRAMchip.ThisissimilartoRASandCASinaconventionalDRAM,exceptthattheyneednotbecalledassuch(thereisn'tacolumn-selectinourdesign).Theaddressbitsfrombothtransfersidentifyauniquesubarrayandrow(cacheline)withinthatsubarray.PartoftheaddressnowidentiestheDRAMchipthathasthecacheline(notrequiredinconven-tionalDRAMbecauseallchipsareactivated).Theentire addressisrequiredbeforethesubarraycanbeidentiedoraccessed.SimilartotheSBAtechnique,afewmorecyclesareaddedtotheDRAMaccesslatency.Anadditionalre-quirementisthateverydevicehastobecapableoflatchingcommandsastheyarereceivedtoenablethecommandbustothenmoveontooperatingadierentdevice.Thiscaneasilybeachievedbyhavingasetofregisters(eachcapableofsignalingonedevice)connectedtoademultiplexerwhichreadscommandsothecommandbusandredirectsthemappropriately.Thedatabusisphysicallynodierentthantheconventionaldesign:foranxNDRAMchip,NdatabitsarecommunicatedbetweentheDRAMchipandthememorycontrollereverybuscycle.Logically,theNbitsfromeveryDRAMchiponaDIMMrankwerepartofthesamecachelineintheconventionaldesign;nowtheyarecompletelyin-dependentanddealwithdierentcachelines.Therefore,itisalmostasifthereareeightindependentnarrowchannelstothisDIMM,withthecaveatthattheyallshareasingleaddress/commandbus.SubarrayOrganization:Theheightofeachsubarray(i.e.thenumberofcachelinesinagivensubarray)directlyde-terminesthedelay/energyperaccesswithinthesubarray.Manysmallsubarraysalsoincreasethepotentialforpar-allelismandlow-powermodes.However,alargenumberofsubarraysimpliesamorecomplexon-dienetworkandmoreenergyanddelaywithinthisnetwork.Italsoen-tailsgreateroverheadfromperipheralcircuitry(decoders,drivers,senseamps,etc.)persubarraywhichdirectlyim-pactsareaandcost-per-bit.Thesearebasictrade-oscon-sideredduringDRAMdesignandevenincorporatedintoanalyticalcachemodelssuchasCACTI6.5[39,49].Fig-ure5showshowanumberofsubarraysinacolumnsharearowbuerthatfeedsthesharedbus.Thesubarraysshar-ingarowbuerarereferredtoasabank,andsimilartotheconventionalmodel,asinglebankcanonlybedealingwithoneaccessatatime.OurSSAimplementationmodelshierarchicalbitlinesinwhichdatareadfromasubarrayaresenttotherowbuerthroughsecondlevelbitlines.Todis-tributeloadandmaximizeconcurrency,dataisinterleavedsuchthatconsecutivecachelinesarerstplacedindierentDRAMchipsandthenindierentbanksofthesamechip.Tolimittheimpactonareaandinterconnectoverheads,ifweassumethesamenumberofbanksperDRAMchipasthebaseline,westillendupwithamuchhighernumberoftotalbanksontheDIMM.Thisisbecauseinthebaselineorganization,thephysicalbanksonallthechipsaresim-plypartsoflargerlogicalbanks.IntheSSAdesign,eachphysicalbankisindependentandamuchhigherdegreeofconcurrencyisoered.OuranalysiswithaheavilyextendedversionofCACTI6.5showedthattheareaoverheadofSSAisonly4%.Sincesubarraywidthsareonly64bytes,sequentialrefreshatthisgranularitywillbemoretime-consuming.However,itisfairlyeasytorefreshmultiplebankssimultaneously,i.e.,theysimplyactasonelargebankforrefreshpurposes.Inaddition,thereexistsimpletechniquestoperformrefreshthatkeeptheDRAMcell'saccesstransistoronlongenoughtorechargethestoragecapacitorimmediatelyafterade-structiveread,withoutinvolvingtherow-buer[27].Power-Downmodes:IntheSSAarchitecture,acachelinerequestisservicedbyasinglebankinasingleDRAMchip,andonlyasinglesubarraywithinthatbankisacti-vated.Sincetheactivation\footprint"oftheaccessismuchsmallerintheSSAdesignthaninthebaseline,thereistheopportunitytopower-downalargeportionoftheremainingareathatmayenjoylongerspellsofinactivity.DatasheetsfromMicron[38]indicatethatmodernchipsalreadysupportmultiplepower-downmodesthatdisablevariouscircuitryliketheinputandoutputbuersorevenfreezetheDLL.Thesemodesdonotdestroythedataonthechipandthechipcanbereactivatedwithalatencypenaltyproportionaltotheamountofcircuitrythathasbeenturnedoandthedepthofthepower-downstate.Weadoptasimplestrategyforpower-down:ifasubarrayhasbeenIdleforIcycles,itgoesintoapower-downmodethatconsumesPtimeslessbackgroundpowerthantheactivemode.Whenarequestislatersenttothissubarray,aWcyclelatencypenaltyisincurredforwake-up.Theresultssectionquantiestheper-formanceandpowerimpactforvariousvaluesofI,P,andW.ImpactSummary:Insummary,theproposedorganiza-tiontargetsdynamicenergyreductionbyonlyactivatingasinglechipandasinglesubarray(withshortwordlinesandexactlytherequirednumberofbitlines)whenaccessingacacheline.Areaoverheadisincreased,comparedtocon-ventionalDRAM,becauseeachsmallsubarrayincurstheoverheadofperipheralcircuitryandbecauseaslightlymorecomplexon-dieinterconnectisrequired.Backgroundenergycanbereducedbecausealargefractionoftheon-chipreales-tateisinactiveatanypointandcanbeplacedinlow-powermodes.TheinterfacebetweenthememorycontrollerandDRAMchipshasbeenchangedbyeectivelysplittingthechannelintomultiplesmallerwidthchannels.Theimpactonreliabilityisdiscussedinthenextsub-section.Perfor-manceisimpactedfavorablybyhavingmanymorebanksperDIMMandhigherconcurrency.Similartothebaseline,ifweassumethateachchiphaseightbanks,theentireDIMMnowhas64banks.Performancemaybeimpactedpositivelyornegativelybyadoptingaclose-pagepolicy.Performanceisnegativelyimpactedbecausethecachelineisreturnedtothememorycontrollerviaseveralserializeddatatransfers(anx8partwilltake64transferstoreturna64bytecacheline).Anegativeimpactisalsoincurredbecausethesubar-rayaccesscanbeginonlyaftertheentireaddressisreceived.WebelievethatSSAissuperiortoSBA,althoughitre-quiresalargerre-designinvestmentfromtheDRAMcom-munity.Firstly,inordertolimittheareaoverheadofhierar-chicalwordlines,SBAisforcedtofetchmultiplecachelines,thusnotcompletelyeliminatingoverfetch.SSAthereforeyieldshigherdynamicenergysavings.BymovingfromlargearraysinSBAtosmallsubarraysinSSA,SSAalsondsmanymoreopportunitiestoplacesubarraysinlow-powerstatesandsaveleakageenergy.Intermsofperformance,SSAishurtbythelongdatatransfertime,andwilloutdoSBAinworkloadsthathaveahighpotentialforbank-levelconcurrency.3.3ChipkillRecentstudieshaveshownthatDRAMsareoftenplaguedwitherrorsandcanleadtosignicantserverdowntimeindatacenters[46].Therefore,alow-powerDRAMdesigntar-getedatdatacentersmustbeamenabletoanarchitecturethatprovidesahighstandardofreliability.Acommonex-pectationofbusiness-criticalserverDRAMsystemsisthattheyareabletowithstandasingleDRAMchipfailure.Just asanentirefamilyoferror-resilientschemescanbebuiltforbitfailures(forexample,SingleErrorCorrectionDoubleEr-rorDetection,SECDED),afamilyoferror-resilientschemescanalsobebuiltforchipfailure(forexample,SingleChiperrorCorrectionDoubleChiperrorDetection,SCCDCD),andthesearereferredtoasChipkill[18,35].WenowfocusonthedesignofanSCCDCDchipkillscheme;thetechniquecanbeeasilygeneralizedtoproducestronger\ravorsoferror-resilience.First,consideraconventionaldesignwhereeachword(say64bits)hasbeenappendedwithan8-bitECCcode,topro-videSECDED.Forachipkillscheme,eachDRAMchipcanonlycontributeonebitoutofthe72-bitword.Ifachipweretocontributeanymore,chipfailurewouldmeanmulti-bitcorruptionwithinthe72-bitword,anerrorthataSECDEDcodecannotrecoverfrom.Therefore,each72-bitwordmustbestripedacross72DRAMchips.Whena64-bytecachelineisrequested,72bytesarereadoutofthe72DRAMchips,makingsurethateach72-bitwordobtainsonlyasinglebitfromeachDRAMchip.SuchanorganizationwasadoptedintheDellPoweredge6400/6450servers[35].ThisprovidessomeoftherationaleforcurrentDRAMsystemsthatstripeacachelineacrossseveralDRAMchips.Thisisclearlyen-ergyinecientas72DRAMchipsareactivatedandaverysmallfractionofthereadbitsarereturnedtotheCPU.ItispossibletoreducethenumberofDRAMchipsactivatedperaccessifweattachECCcodestosmallerwordsashasbeendoneintheIBMNetnitysystems[18].Thiswillhavehighstorageoverhead,butgreaterenergyeciency.Forexam-ple,inadesignattachinganECCwordto8bits,say,onemayneedveextraDRAMchipspereightDRAMchipsonasingleDIMM.ECCgetsprogressivelymoreecientasthegranularityatwhichitisattachedisincreased.IntheSSAdesign,weintendtogettheentirecachelinefromasingleDRAMchipaccess.IfthisDRAMchipweretoproducecorrupteddata,theremustbeawaytore-constructit.Thisisaproblemformulationalmostexactlythesameasthatforreliabledisks.Wethereforeadoptasolutionverysimilartothatofthewell-studiedRAID[20]solutionfordisks,butthathasneverbeenpreviouslyemployedwithinaDIMM.NotethatsomecurrentserversystemsdoemployRAID-likeschemesacrossDIMMs[2,6];withinaDIMM,conventionalECCwithanextraDRAMchipisemployed.Thesesuerfromhighenergyoverheadsduetothelargenumberofchipsaccessedoneveryreadorwrite.Ourap-proachisdistinctandmoreenergy-ecient.InanexampleRAIDdesign,asinglediskservesasthe\parity"disktoeightotherdatadisks.Onadiskaccess(specicallyinRAID-4andRAID-5),onlyasinglediskisread.Achecksumassoci-atedwiththereadblock(andstoredwiththedatablockonthedisk)letstheRAIDcontrollerknowifthereadiscorrectornot.Ifthereisanerror,theRAIDcontrollerre-constructsthecorruptedblockbyreadingtheothersevendatadisksandtheparitydisk.Inthecommonerror-freecase,onlyonediskneedstobeaccessedbecausethechecksumenablesself-containederrordetection.Itisnotfool-proofbecausetheblock+checksummaybecorrupted,andthechecksummaycoincidentallybecorrect(thelargerthechecksum,thelowertheprobabilityofsuchasilentdatacorruption).Also,theparityoverheadcanbemadearbitrarilylowbyhavingoneparitydiskformanydatadisks.Thisisstillgoodenoughforerrordetectionandrecoverybecausethechecksumhasalreadyplayedtheroleofdetectingandidentifyingthecor- DIMM DRAMDEVICE DIMM L0 L1 L2 L3 L4 L5 L6 L7 P0 C L9 L10 L11 L12 L13 L14 L1 P1 L8 DRAMDEVICE L9 C L10 C L11 C L12 C L13 C L14 C L1 5 C P1 C L8 C. ........ L56 L57 L58 L59 L60 L61 L62 L63 P7 LCache LineC Local ChecksumP Global Parity Figure6:ChipkillsupportinSSA(onlyshownfor64cachelines).ruptedbits.Thecatchisthatwritesaremoreexpensiveaseverywriterequiresareadoftheolddatablock,areadoftheoldparityblock,awritetothedatablock,andawritetotheparityblock.RAID-5ensuresthattheparityblocksaredistributedamongallninediskssothatnoonediskemergesasawritebottleneck.WeadoptthesameRAID-5approachinourDRAMSSAdesign(Fig.6).TheDRAMarraymicroarchitecturemustnowbemodiedtonotonlyaccommodateacacheline,butalsoitsassociatedchecksum.Weassumeaneightbitcheck-sum,resultinginastorageoverheadof1.625%fora64-bytecacheline.Thechecksumfunctionusesbitinversionsothatstuck-at-zerofaultsdonotgoundetected.ThechecksumisreturnedtotheCPUafterthecachelinereturnandthever-icationhappensinthememorycontroller(alargerburstlengthisrequired,notadditionalDRAMpins).WecannotallowthevericationtohappenattheDRAMchipbecauseacorruptedchipmaysimply\ragallaccessesassuccessfullypassingthechecksumtest.TheDIMMwillnowhaveoneextraDRAMchip,astorageoverheadof12.5%foroureval-uatedplatform.MostreadsonlyrequirethatoneDRAMchipbeaccessed.AwriterequiresthattwoDRAMchipsbereadandthenwritten.Thisistheprimaryperformanceoverheadofthisschemeasitincreasesbankcontention(notethatanincreaseinwritelatencydoesnotimpactperfor-mancebecauseofread-bypassingatintermediatebuersatthememorycontroller).Wequantifythiseectinthere-sultssection.Thisalsoincreasesenergyconsumption,butitisstillfarlessthantheenergyofreliableornon-reliableconventionalDRAMsystems.Mostchipkill-levelreliabilitysolutionshaveahigherstor-ageoverheadthanourtechnique.Asdescribedabove,theenergy-ecientsolutionscanhaveashighas62.5%over-head,theDellPoweredgesolutionhasa12.5%overhead(butrequiressimultaneousaccessto72DRAMchips),andtherank-sub-settingDRAMmodelofAhnetal.[8]hasa37.5%overhead.Thekeytoourhighereciencyisthelo-calizationofanentirecachelinetoasingleDRAMchipandtheuseofchecksumforself-containederrordetectionatmodestoverhead(1.625%)plusaparitychip(12.5%for8-wayparity).Evenonwrites,whenfourDRAMaccessesarerequired,wetouchfewerDRAMchipsandreadonlyasinglecachelineineach,comparedtoanyofthepriorso-lutionsforchipkill[8,18,35].Therefore,ourproposedSSAarchitecturewithchipkillfunctionalityisbetterthanothersolutionsintermsofareacostandenergy.AsweshowinSection4,theperformanceimpactofwritecontentionisalsolowbecauseofthehighdegreeofbankconcurrencyaordedbySSA. Processor 8-CoreOOO,2GHz L1cache FullyPrivate,3cycle 2-way,32KBeachIandD L2cache Fullyshared,10cycle 8-way,2MB,64BCachelines Row-buersize 8KB DRAMFrequency 400MHz DRAMPart 256MB,x8 ChipsperDIMM 16 Channels 1 Ranks 2 Banks 4 T-rcd,T-cas,T-rp 5DRAMcyc Table1:Generalparameters4.RESULTS4.1MethodologyWemodelabaseline,8-core,out-of-orderprocessorwithprivateL1cachesandasharedL2cache.Weassumeamainmemorycapacityof4GBorganizedasshowninTable1.OursimulationinfrastructureusesVirtutech'sSIMICS[5]full-systemsimulator,without-of-ordertimingsupportedbySimics'`ooo-micro-arch'module.The`trans-staller'mod-ulewasheavilymodiedtoaccuratelycaptureDRAMde-vicetiminginformationincludingmultiplechannels,ranks,banksandopenrowsineachbank.Bothopen-andclose-rowpagemanagementpolicieswithrst-come-rst-serve(FCFS)andrst-ready-rst-come-rst-serve(FR-FCFS)schedulingwithappropriatequeuingdelaysareaccuratelymodeled.Wealsomodeloverlappedprocessingofcommandsbythemem-orycontrollertohideprechargeandactivationdelayswhenpossible.Wealsoincludeaccuratebusmodelsfordatatrans-ferbetweenthememorycontrollerandtheDIMMs.AddressmappingpolicieswereadoptedfromtheDRAMSim[52]frameworkandfrom[27].DRAMtiminginformationwasobtainedfromMicrondatasheets[38].Area,latencyandenergynumbersforDRAMbankswereobtainedfromCACTI6.5[1],heavilymodiedtoincludeac-curatemodelsforcommodityDRAM,bothforthebaselinedesignandwithhierarchicalwordlines.Bydefault,CACTIdividesalargeDRAMarrayintoanumberofmatswithanH-treetoconnectthemats.Suchanorganizationincurslowlatencybutrequireslargearea.However,traditionalDRAMbanksareheavilyoptimizedforareatoreducecostandemployverylargearrayswithminimalperipheralcir-cuitryoverhead.Readorwriteoperationsaretypicallydoneusinglongmulti-levelhierarchicalbitlinesspanningthear-rayinsteadofusinganH-treeinterconnect.WemodiedCACTItore\rectsuchacommodityDRAMimplementa-tion.Notethatwithahierarchicalbitlineimplementation,thereisapotentialopportunitytotrade-obitlineenergyforareabyonlyusinghierarchicalwordlinesatthehigher-levelbitlineandleavingtherst-levelbitlinesuntouched.Inthiswork,wedonotexplorethistrade-o.Instead,wefocusonthemaximumenergyreductionpossible.TheDRAMen-ergyparametersusedinourevaluationarelistedinTable2.Weevaluateourproposalsonsubsetsofthemulti-threadedPARSEC[13],NAS[9]andSTREAM[4]benchmarksuites.Weruneveryapplicationfor2millionDRAMaccesses(cor-respondingtomanyhundredsofmillionsofinstructions)andreporttotalenergyconsumptionandIPC. Component Dynamic Energy(nJ) Decoder+Wordline +Senseamps-Baseline 1.429 Decoder+Wordline +Senseamps-SBA 0.024 Decoder+Wordline +Senseamps-SSA 0.013 Bitlines-Baseline 19.282 Bitlines-SBA/SSA 0.151 TerminationResistors Baseline/SBA/SSA 7.323 OutputDrivers 2.185 GlobalInterconnect Baseline/SBA/SSA 1.143 Low-powermode Background Power(mW) Active 104.5 PowerDown(3mem.cyc) 19.0 SelfRefresh(200mem.cyc) 10.8 Table2:Energyparameters4.2ResultsWerstdiscusstheenergyadvantageoftheSBAandSSAschemes.Wethenevaluatetheperformancecharacteristicsandareaoverheadsoftheproposedschemesrelativetothebaselineorganization.4.2.1EnergyCharacteristicsFigure7showstheenergyconsumptionoftheclose-pagebaseline,SBA,andSSA,normalizedtotheopen-pagebase-line.Theclose-pagebaselineisclearlyworseintermsofen-ergyconsumptionthantheopen-pagebaselinesimplyduetothefactthatevenaccessesthatwerepotentiallyrow-buerhits(thusnotincurringtheenergyofactivatingtheentirerowagain)nowneedtogothroughtheentireactivate-read-prechargecycle.Weseeanaverageincreaseinenergycon-sumptionby73%onaverage,withindividualbenchmarkbehaviorvaryingbasedontheirrespectiverow-buerhitrates.WeseefromFigure8(anaverageacrossallbench-marks)thatinthebaselineorganizations(bothopenandcloserow),thetotalenergyconsumptioninthedeviceisdominatedbyenergyinthebitlines.Thisisbecauseeveryaccesstoanewrowresultsinalargenumberofbitlinesget-tingactivatedtwice,oncetoreaddataoutofthecellsintotherow-buerandoncetoprechargethearray.MovingtotheSBAorSSAschemeseliminatesahugepor-tionofthisenergycomponent.BywaitingfortheCASsig-nalandonlyactivating/prechargingtheexactcachelinethatweneed,bitlineenergygoesdownbyafactorof128.Thisresultsinadramaticenergyreductiononeveryaccess.How-ever,asdiscussedpreviously,prohibitiveareaoverheadsne-cessitatecoarsergrainedselectioninSBA,leadingtoslightlylargerenergyconsumptioncomparedtoSSA.Comparedtoabaselineopen-pagesystem,weseeaveragedynamicmem-oryenergysavingsof3XinSBAandover6.4XinSSA.Notethattheproposedoptimizationsresultinenergyre-ductiononlyinthebitlines.Theenergyoverheadduetoothercomponentssuchasdecoder,pre-decoder,inter-bankbus,bustermination,etc.remainsthesame.Hence,theircontributiontothetotalenergyincreasesasbitlineenergygoesdown.LocalizingandmanagingDRAMaccessesat 0.501.001.502.002.50 e DRAMEnergyConsumption BaselineOpenRow BaselineCloseRow SBA 0.000.501.001.502.002.50 RelativeDRAMEnergyConsumption BaselineOpenRow BaselineCloseRow SBA SSA Figure7:DRAMdynamicenergyconsumption 20%30%40%50%60%70%80%90%100% TerminationResistors GlobalInterconnect Bitlines Decoder 0%10%20%30%40%50%60%70%80%90%100%BASELINE(OPENPAGE,FRFCFS)BASELINE(CLOSEDROW,FCFS)SBASSA TerminationResistors GlobalInterconnect Bitlines DecoderWordlineSenseamps Figure8:ContributorstoDRAMdynamicenergyagranularityasneasasubarrayallowsmoreopportu-nitytoputlargerpartsoftheDRAMintolow-powerstates.CurrentDRAMdevicessupportmultiplelevelsofpower-down,withdierentlevelsofcircuitrybeingturnedo,andcorrespondinglylargerwake-uppenalties.Weevaluatetwosimplelow-powermodeswithP(Powersavingsfactor)andW(Wakeup)valuescalculatedbasedonnumbersshowninTable2,obtainedfromtheMicrondatasheetandpowersys-temcalculator[3,38].Inthedeepestsleepmode,SelfRe-fresh,Pis10andWis200memorycycles.AlessdeepsleepmodeisPowerDown,wherePis5.5,butWisjust3memorycycles.WevaryI(Idlecyclethreshold)asmul-tiplesofthewake-uptimeW.Figures9and10showtheimpactoftheselow-powerstatesonperformanceanden-ergyconsumptionintheSSAorganization.WeseethatthemoreexpensiveSelfRefreshlow-powermodeactuallybuysusmuchlowerenergysavingscomparedtothemoree-cientPowerDownmode.Aswebecomelessaggressiveintransitioningtolow-powerstates(increaseI),theaveragememorylatencypenaltygoesdown,fromjustover5%tojustover2%forthe\Power-down"mode.Thepercentageoftimewecanputsubarraysinlow-powermodecorrespond-inglychangesfromalmost99%toabout86%withenergysavingsbetween81%and70%.TheperformanceimpactsaremuchlargerfortheexpensiveSelf-Refreshmode,goingfromover400%ataveryaggressiveItounder20%intheleastaggressivecase.Correspondingly,bankscanbeputinthisstatebetween95%and20%ofthetime,withenergysavingsrangingfrom85%to20%.Naturally,thesepowerdownmodescanbeappliedtothebaselinearchitectureaswell.However,thegranularityatwhichthiscanbedoneismuchcoarser,aDIMMbankatbest.Thismeansthattherearefeweropportunitiestomoveintolow-powerstates.As 100.00150.00200.00250.00300.00350.00400.00450.00 c entageIncreasein M emoryLatnecy SelfRefresh PowerDown 50.000.0050.00100.00150.00200.00250.00300.00350.00400.00450.00101001000PercentageIncreaseinMemoryLatnecyThresholdValue(xWakeuptime) SelfRefresh PowerDown Figure9:Memorylatencyimpactofusinglow-powerstates 20 30405060708090100entageReductionin a ckgroundEnergy SelfRefresh PowerDown 102030405060708090100101001000PercentageReductioninBackgroundEnergyThresholdValue(xWakeuptime) SelfRefresh PowerDown Figure10:Energyreductionusinglow-powerstatesacomparison,westudytheapplicationofthelow-overhead\PowerDown"statetothebaseline.Wendthatonaver-age,evenwithanaggressivesleepthreshold,bankscanonlybeputinthismodeabout80%ofthetime,whileincurringapenaltyof16%intermsofaddedmemorylatency.Beinglessaggressivedramaticallyimpactstheabilitytopowerdownthebaseline,withbanksgoingintosleepmodeonly17%ofthetimewithaminimal3%latencypenalty.Asanothercomparisonpoint,weconsiderthepercentageoftimesub-arraysorbankscanbeputinthedeepestsleepSelfRefreshmodeinSSAvs.thebaseline,foraconstant10%latencyoverhead.WendthatsubarraysinSSAcangointodeepsleepnearly18%ofthetimewhereasbanksinthebaselinecanonlygointodeepsleepabout5%ofthetime.4.2.2PerformanceCharacteristicsEmployingeithertheSBAorSSAschemesimpactsmem-oryaccesslatency(positivelyornegatively)asshowninFig-ure11.Figure12thenbreaksthislatencydownintotheav-eragecontributionsofthevariouscomponents.Oneoftheprimaryfactorsaectingthislatencyisthepagemanage-mentpolicy.Movingtoaclose-pagepolicyfromanopen-pagebaselineactuallyresultsinadropinaveragememorylatencybyabout17%foramajority(10of12)ofourbench-marks.ThishasfavorableimplicationsforSBAandSSAwhichmustuseaclose-pagepolicy.Theremainingbench-marksseeanincreaseinmemorylatencybyabout28%onaveragewhenmovingtoclose-page.Employingthe\Posted-RAS"schemeintheSBAmodelcausesanadditionalsmalllatencyofjustover10%onaverage(neglectingtwooutliers).AsseeninFigure12,forthesethreemodels,thequeuingdelayisthedominantcontributortototalmemoryaccesslatency.Priorwork[15]hasalsoshownthistobetruein 10000 200.00300.00400.00500.00600.00700.00800.00Cycles BaselineOpenPage BaselineClosePage SBA 0.00100.00200.00300.00400.00500.00600.00700.00800.00 Cycles BaselineOpenPage BaselineClosePage SBA SSA Figure11:Averagemainmemorylatency 20%30%40%50%60%70%80%90%100% DataTransfer DRAMCoreAccess RankSwitchingdelay(ODT) Command/Addr Transfer 0%10%20%30%40%50%60%70%80%90%100%BASELINE(OPENPAGE,FRFCFS)BASELINE(CLOSEDROW,FCFS)SBASSA DataTransfer DRAMCoreAccess RankSwitchingdelay(ODT) Command/AddrTransfer QueuingDelay Figure12:ContributorstototalmemorylatencymanyDRAMsystems.Wethereforeseethattheadditionallatencyintroducedbythe\Posted-RAS"doesnotsigni-cantlychangeaveragememoryaccesslatency.TheSSAscheme,however,hasanentirelydierentbot-tleneck.Everycachelinereturnisnowserializedoverjust8linkstothememorycontroller.Thisdatatransferdelaynowbecomesthedominantfactorinthetotalaccesstime.However,thisisosettosomeextentbyalargeincreaseinparallelisminthesystem.Eachofthe8devicescannowbeservicingindependentsetsofrequests,signicantlyreduc-ingthequeuingdelay.Asaresult,wedonotseeagreatlyincreasedmemorylatency.Onhalfofourbenchmarks,weseelatencyincreasesofjustunder40%.Theotherbench-marksareactuallyabletoexploittheparallelismmuchbet-ter,andthismorethancompensatesfortheserializationlatency,withaverageaccesstimegoingdownbyabout30%.Thesearealsotheapplicationswiththehighestmemoryla-tencies.Asaresult,overall,SSAinfactoutperformsallothermodels.Figure13showstherelativeIPCsofthevariousschemesunderconsideration.Likewesawforthememorylatencynumbers,amajorityofourbenchmarksperformbetterwithaclose-rowpolicythanwithanopen-rowpolicy.Weseeper-formanceimprovementsofjustunder10%onaverage(ne-glectingtwooutliers)for9ofour12benchmarks.Theotherthreesuereddegradationsofabout26%onaverage.Thesewerethebenchmarkswithrelativelyhigherlast-levelcachemissrates(ontheorderof10every1000instructions).Em-ployingthe\PostedRAS"resultsinamarginalIPCdegrada-tionoverclose-rowbaseline,about4%onaverage,neglectingtwooutlierbenchmarks.TheSSAschemeseesaperformancedegradationof13%onaveragecomparedtotheopen-pagebaselineonthesixbenchmarksthatsawamemorylatencyincrease.Theother 0.501.001.502.002.50NormalizedIPC BaselineOpenPage BaselineClosePage SBA SSA 0.000.501.001.502.002.50 NormalizedIPC BaselineOpenPage BaselineClosePage SBA SSA SSAChipkill Figure13:NormalizedIPCsofvariousorganizations6benchmarkswithadecreasedmemoryaccesslatencyseeperformancegainsof54%onaverage.ThesehighnumbersareobservedbecausetheseapplicationsareclearlylimitedbybankcontentionandSSAaddressesthisbottleneck.Tosummarize,inadditiontosignicantlyloweredDRAMac-cessenergies,SSAoccasionallycanboostperformance,whileyieldingminorperformanceslowdownsforothers.Weex-pectSSAtoyieldevenhigherimprovementsinthefutureasevermorecoresexerthigherqueuingpressuresonmemorycontrollers.Figure13showstheIPCdegradationcausedwhenweaugmentSSAwithourchipkillsolution.Notethatthisisentirelybecauseoftheincreasedbankcontentiondur-ingwrites.Onaverage,theincreaseinmemorylatencyisalittleover70%,resultingina12%degradationinIPC.Comparedtothenon-chipkillSSA,thereisalsoadditionalenergyconsumptiononeverywrite,resultingina2.2Xin-creaseindynamicenergytoprovidechipkill-levelreliability,whichisstillsignicantlylowerthanabaselineorganization.4.2.3SystemLevelCharacteristicsToevaluatethesystemlevelimpactofourschemes,weuseasimplemodelwheretheDRAMsubsystemconsumes40%oftotalsystempower(32%dynamicand8%background).Changesinperformanceareassumedtolinearlyimpactthepowerconsumptionintherestofthesystem,bothback-groundanddynamic.Havingtakentheseintoaccount,onaverage,wesee18%and36%reductionsinsystempowerwithSBAandSSArespectively.5.RELATEDWORKThesignicantcontributionofDRAMtooverallsystempowerconsumptionhasbeendocumentedinseveralstud-ies[10,32,37].Amajorityoftechniquesaimedatconserv-ingDRAMenergytrytotransitioninactiveDRAMchipstolowpowerstates[30]aseectivelyaspossibletodecreasethebackgroundpower.Researchershaveinvestigatedpre-dictionmodelsforDRAMactivity[19],adaptivememorycontrollerpolicies[23],compiler-directedhardware-assisteddatalayout[16],managementofDMAandCPUgeneratedrequeststreamstoincreaseDRAMidleperiods[22,43]aswellasmanagingthevirtualmemoryfootprintandphysi-calmemoryallocationschemes[14,17,21]totransitionidleDRAMdevicestolowpowermodes.ThethemeoftheothermajorvolumeofworkaimedatDRAMpowerreductionhasinvolvedrank-subsetting.Inadditiontoexploitinglow-powerstates,thesetechniquesat-tempttoreducethedynamicenergycomponentofanac-cess.Zhengetal.suggestthesubdivisionofaconventional DRAMrankintomini-ranks[56]comprisingofasubsetofDRAMdevices.Ahnetal.[7,8]proposeaschemewhereeachDRAMdevicecanbecontrolledindividuallyviaade-muxregisterperchannelthatisresponsibleforroutingallcommandsignalstotheappropriatechip.Intheirmulti-coreDIMMproposal,multipleDRAMdevicesonaDIMMcanbecombinedtoformaVirtualMemoryDevice(VMD)andacachelineissuppliedbyonesuchVMD.Theyfurtherextendtheirworkwithacomprehensiveanalyticalmodeltoestimatetheimplicationsofrank-subsettingonperformanceandpower.TheyalsoidentifytheneedtohavemechanismsthatwouldensurechipkilllevelreliabilityandextendtheirdesignswithSCCDCDmechanisms.AsimilarapproachwasproposedbyWareetal.byemployinghigh-speedsignalstosendchipselectsseparatelytopartsofaDIMMinor-dertoachievedual/quadthreadedDIMMs[53].Ontheotherhand,Sudanetal.[47]attempttoimproverow-buerutilizationbypackingheavilyusedcachelinesinto\Micro-Pages".OtherDRAM-relatedworkincludesdesignfor3-Darchi-tectures(Loh[36]),anddesignforsystemswithphotonicinterconnects(Vantreaseetal.[51]andBeameretal.[12]).YoonandErez[55]outlineecientchipkill-levelreliabilitymechanismsforDRAMsystemsbutworkwithexistingmi-croarchitecturesanddatalayouts.However,tothebestofourknowledge,ourworkisthersttoattemptfundamentalmicroarchitecturalchangestotheDRAMsystemspecicallytargetingreducedenergycon-sumption.OurSBAmechanismwithPosted-RASisanovelwaytoreduceactivationandcaneliminateoverfetch.TheSSAmechanismre-organizesthelayoutofaDRAMchiptosupportsmallsubarraysandthemappingofdatatoonlyactivateasinglesubarray.Ourchipkillsolutionthatuseschecksum-baseddetectionandRAID-likecorrectionhasnotbeenpreviousconsideredandismoreeectivethanthoseusedforpriorDRAMchipkillsolutions[8,18,35].6.CONCLUSIONSWeproposetwonoveltechniquestoeliminateoverfetchinDRAMsystemsbyactivatingonlythenecessarybit-lines(SBA)andthengoingasfarastoisolateanentirecachelinetoasinglesmallsubarrayonasingleDRAMchip(SSA).Oursolutionswillrequirenon-trivialinitialdesigneortonthepartofDRAMvendorsandwillincurminorarea/costincreases.AsimilararchitecturewilllikelyalsobesuitableforemergingmemorytechnologiessuchasPCMandSTT-RAM.Thememoryenergyreductionsfromourtechniquesaresubstantialforbothdynamic(6X)andback-ground(5X)components.WeobservethatfetchingexactlyacachelinewithSSAcanimproveperformanceinsomecases(over50%onaverage)duetoitsclose-pagepolicyandalsobecauseithelpsalleviatebankcontentioninsomememory-sensitiveapplications.Inotherapplicationsthatarenotasconstrainedbybankcontention,theSSApolicycancauseperformancedegradations(13%onaverage)be-causeoflongcachelinedatatransfertimesoutofasingleDRAMchip.Anyapproachthatreducesthenumberofchipsusedtostoreacachelinealsoincreasestheprobabilityofcorrelatederrors.WithSSA,wereadanentirecachelineoutofasingleDRAMarray,sothepotentialforcorrelatederrorsisincreased.Inordertoprovidechipkill-levelreliabilityinconcertwithSSA,weintroducedchecksumsstoredforeachcachelineintheDRAM,similartothatprovidedinharddrives.Usingthechecksumwecanproviderobusterrordetectioncapabilities,andprovidechipkill-levelreliabilitythroughRAIDtechniques(howeverinourcase,weuseaRedundantArrayofInexpensiveDRAMs).Weshowthatthisapproachismoreeectiveintermsofareaandenergythanpriorchipkillapproaches,andonlyincursa12%perfor-mancepenaltycomparedtoanSSAmemorysystemwithoutchipkill.7.ACKNOWLEDGMENTSThisworkwassupportedinpartsbyNSFgrantsCCF-0430063,CCF-0811249,CCF-0916436,NSFCAREERawardCCF-0545959,SRCgrant1847.001,andtheUniversityofUtah.TheauthorswouldalsoliketothankUtahArchgroupmembersKshitijSudan,ManuAwasthi,andDavidNellansforhelpwiththebaselineDRAMsimulator.8.REFERENCES[1]CACTI:AnIntegratedCacheandMemoryAccessTime,CycleTime,Area,Leakage,andDynamicPowerModel.http://www.hpl.hp.com/research/cacti/.[2]HPAdvancedMemoryProtectionTechnologies-TechnologyBrief.http://www.hp.com.[3]MicronSystemPowerCalculator.http://www.micron.com/support/partinfo/powercalc.[4]STREAM-SustainableMemoryBandwidthinHighPerformanceComputers.http://www.cs.virginia.edu/stream/.[5]VirtutechSimicsFullSystemSimulator.http://www.virtutech.com.[6]M.Abbottetal.DurableMemoryRS/6000SystemDesign.InProceedingsofInternationalSymposiumonFault-TolerantComputing,1994.[7]J.Ahn,J.Leverich,R.S.Schreiber,andN.Jouppi.MulticoreDIMM:anEnergyEcientMemoryModulewithIndependentlyControlledDRAMs.IEEEComputerArchitectureLetters,vol.7(1),2008.[8]J.H.Ahn,N.P.Jouppi,C.Kozyrakis,J.Leverich,andR.S.Schreiber.FutureScalingofProcessor-MemoryInterfaces.InProceedingsofSC,2009.[9]D.Baileyetal.TheNASParallelBenchmarks.InternationalJournalofSupercomputerApplications,5(3):63{73,Fall1991.[10]L.Barroso.ThePriceofPerformance.Queue,3(7):48{53,2005.[11]L.BarrosoandU.Holzle.TheDatacenterasaComputer:AnIntroductiontotheDesignofWarehouse-ScaleMachines.Morgan&Claypool,2009.[12]S.Beameretal.Re-ArchitectingDRAMMemorySystemswithMonolithicallyIntegratedSiliconPhotonics.InProceedingsofISCA,2010.[13]C.Benia,S.Kumar,J.P.Singh,andK.Li.ThePARSECBenchmarkSuite:CharacterizationandArchitecturalImplications.Technicalreport,DepartmentofComputerScience,PrincetonUniversity,2008.[14]P.Burnsetal.DynamicTrackingofPageMissRatioCurveforMemoryManagement.InProceedingsofASPLOS,2004. [15]V.CuppuandB.Jacob.Concurrency,Latency,orSystemOverhead:WhichHastheLargestImpactonUniprocessorDRAM-SystemPerformance.InProceedingsofISCA,2001.[16]V.Delaluzetal.DRAMEnergyManagementUsingSoftwareandHardwareDirectedPowerModeControl.InProceedingsofHPCA,2001.[17]V.Delaluzetal.Scheduler-basedDRAMEnergyManagement.InProceedingsofDAC,2002.[18]T.J.Dell.AWhitepaperontheBenetsofChipkill-CorrectECCforPCServerMainMemory.Technicalreport,IBMMicroelectronicsDivision,1997.[19]X.Fan,H.Zeng,andC.Ellis.MemoryControllerPoliciesforDRAMPowerManagement.InProceedingsofISLPED,2001.[20]J.L.HennessyandD.A.Patterson.ComputerArchitecture:AQuantitativeApproach.Elsevier,4thedition,2007.[21]H.Huang,P.Pillai,andK.G.Shin.DesignAndImplementationOfPower-AwareVirtualMemory.InProceedingsOfTheAnnualConferenceOnUsenixAnnualTechnicalConference,2003.[22]H.Huang,K.Shin,C.Lefurgy,andT.Keller.ImprovingEnergyEciencybyMakingDRAMLessRandomlyAccessed.InProceedingsofISLPED,2005.[23]I.HurandC.Lin.AComprehensiveApproachtoDRAMPowerManagement.InProceedingsofHPCA,2008.[24]E.Ipek,O.Mutlu,J.Martinez,andR.Caruana.SelfOptimizingMemoryControllers:AReinforcementLearningApproach.InProceedingsofISCA,2008.[25]K.Itoh.VLSIMemoryChipDesign.Springer,2001.[26]ITRS.InternationalTechnologyRoadmapforSemiconductors,2007Edition.http://www.itrs.net/Links/2007ITRS/Home2007.htm.[27]B.Jacob,S.W.Ng,andD.T.Wang.MemorySystems-Cache,DRAM,Disk.Elsevier,2008.[28]M.Kumanoyaetal.AnOptimizedDesignforHigh-PerformanceMegabitDRAMs.ElectronicsandCommunicationsinJapan,72(8),2007.[29]O.La.SDRAMhavingpostedCASfunctionofJEDECstandard,2002.UnitedStatesPatent,Number6483769.[30]A.Lebeck,X.Fan,H.Zeng,andC.Ellis.PowerAwarePageAllocation.InProceedingsofASPLOS,2000.[31]C.Lee,O.Mutlu,V.Narasiman,andY.Patt.Prefetch-AwareDRAMControllers.InProceedingsofMICRO,2008.[32]C.Lefurgyetal.Energymanagementforcommercialservers.IEEEComputer,36(2):39{48,2003.[33]K.Limetal.UnderstandingandDesigningNewServerArchitecturesforEmergingWarehouse-ComputingEnvironments.InProceedingsofISCA,2008.[34]K.Limetal.DisaggregatedMemoryforExpansionandSharinginBladeServers.InProceedingsofISCA,2009.[35]D.Locklear.ChipkillCorrectMemoryArchitecture.Technicalreport,Dell,2000.[36]G.Loh.3D-StackedMemoryArchitecturesforMulti-CoreProcessors.InProceedingsofISCA,2008.[37]D.Meisner,B.Gold,andT.Wenisch.PowerNap:EliminatingServerIdlePower.InProceedingsofASPLOS,2009.[38]MicronTechnologyInc.MicronDDR2SDRAMPartMT47H256M8,2006.[39]N.Muralimanohar,R.Balasubramonian,andN.Jouppi.OptimizingNUCAOrganizationsandWiringAlternativesforLargeCacheswithCACTI6.0.InProceedingsofMICRO,2007.[40]O.MutluandT.Moscibroda.Stall-TimeFairMemoryAccessSchedulingforChipMultiprocessors.InProceedingsofMICRO,2007.[41]O.MutluandT.Moscibroda.Parallelism-AwareBatchScheduling:EnhancingBothPerformanceandFairnessofSharedDRAMSystems.InProceedingsofISCA,2008.[42]U.Nawatheetal.An8-Core64-Thread64bPower-EcientSPARCSoC.InProceedingsofISSCC,2007.[43]V.Pandey,W.Jiang,Y.Zhou,andR.Bianchini.DMA-AwareMemoryEnergyManagement.InProceedingsofHPCA,2006.[44]B.Rogersetal.ScalingtheBandwidthWall:ChallengesinandAvenuesforCMPScaling.InProceedingsofISCA,2009.[45]V.Romanchenko.Quad-CoreOpteron:ArchitectureandRoadmaps.http://www.digital-daily.com/cpu/quad core opteron.[46]B.Schroeder,E.Pinheiro,andW.Weber.DRAMErrorsintheWild:ALarge-ScaleFieldStudy.InProceedingsofSIGMETRICS,2009.[47]K.Sudan,N.Chatterjee,D.Nellans,M.Awasthi,R.Balasubramonian,andA.Davis.Micro-Pages:IncreasingDRAMEciencywithLocality-AwareDataPlacement.InProceedingsofASPLOS-XV,2010.[48]R.Swinburne.IntelCorei7-NehalemArchitectureDive.http://www.bit-tech.net/hardware/2008/11/03/intel-core-i7-nehalem-architecture-dive/.[49]S.Thoziyoor,N.Muralimanohar,andN.Jouppi.CACTI5.0.Technicalreport,HPLaboratories,2007.[50]U.S.EnvironmentalProtectionAgency-EnergyStarProgram.ReportToCongressonServerandDataCenterEnergyEciency-PublicLaw109-431,2007.[51]D.Vantreaseetal.Corona:SystemImplicationsofEmergingNanophotonicTechnology.InProceedingsofISCA,2008.[52]D.Wangetal.DRAMsim:AMemory-SystemSimulator.InSIGARCHComputerArchitectureNews,volume33,September2005.[53]F.A.WareandC.Hampel.ImprovingPowerandDataEciencywithThreadedMemoryModules.InProceedingsofICCD,2006.[54]D.Wentzlaetal.On-ChipInterconnectionArchitectureoftheTileProcessor.InIEEEMicro,volume22,2007.[55]D.YoonandM.Erez.VirtualizedandFlexibleECCforMainMemory.InProceedingsofASPLOS,2010.[56]H.Zhengetal.Mini-Rank:AdaptiveDRAMArchitectureForImprovingMemoryPowerEciency.InProceedingsofMICRO,2008.