/
Rethinking DRAM Design and Organization for EnergyConstrained MultiCores Aniruddha N Rethinking DRAM Design and Organization for EnergyConstrained MultiCores Aniruddha N

Rethinking DRAM Design and Organization for EnergyConstrained MultiCores Aniruddha N - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
446 views
Uploaded On 2015-01-18

Rethinking DRAM Design and Organization for EnergyConstrained MultiCores Aniruddha N - PPT Presentation

Udipi University of Utah Salt Lake City UT udipicsutahedu Naveen Muralimanohar HewlettPackard Laboratories Palo Alto CA naveenmuralihpcom Niladrish Chatterjee University of Utah Salt Lake City UT nilcsutahedu Rajeev Balasubramonian University of Uta ID: 33048

Udipi University Utah

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Rethinking DRAM Design and Organization ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

RethinkingDRAMDesignandOrganizationforEnergy-ConstrainedMulti-CoresAniruddhaN.UdipiUniversityofUtahSaltLakeCity,UTudipi@cs.utah.eduNaveenMuralimanoharHewlett-PackardLaboratoriesPaloAlto,CAnaveen.murali@hp.comNiladrishChatterjeeUniversityofUtahSaltLakeCity,UTnil@cs.utah.eduRajeevBalasubramonianUniversityofUtahSaltLakeCity,UTrajeev@cs.utah.eduAlDavisUniversityofUtahSaltLakeCity,UTald@cs.utah.eduNormanP.JouppiHewlett-PackardLaboratoriesPaloAlto,CAnorm.jouppi@hp.comABSTRACTDRAMvendorshavetraditionallyoptimizedthecost-per-bitmetric,oftenmakingdesigndecisionsthatincuren-ergypenalties.AprimeexampleistheoverfetchfeatureinDRAM,whereasinglerequestactivatesthousandsofbit-linesinmanyDRAMchips,onlytoreturnasinglecachelinetotheCPU.Thefocusoncost-per-bitisquestionableinmodern-dayserverswhereoperatingcostscaneasilyex-ceedthepurchasecost.Moderntechnologytrendsarealsoplacingverydi erentdemandsonthememorysystem:(i)queuingdelaysareasigni cantcomponentofmemoryac-cesstime,(ii)thereisahighenergypremiumforthelevelofreliabilityexpectedforbusiness-criticalcomputing,and(iii)thememoryaccessstreamemergingfrommulti-coresystemsexhibitslimitedlocality.AllofthesetrendsnecessitateanoverhaulofDRAMarchitecture,evenifitmeansaslightcompromiseinthecost-per-bitmetric.Thispaperexaminesthreeprimaryinnovations.The rstisamodi cationtoDRAMchipmicroarchitecturethatre-tainsthetraditionalDDRxSDRAMinterface.SelectiveBit-lineActivation(SBA)waitsforbothRAS(rowaddress)andCAS(columnaddress)signalstoarrivebeforeactivatingex-actlythosebitlinesthatprovidetherequestedcacheline.SBAreducesenergyconsumptionwhileincurringslightareaandperformancepenalties.Thesecondinnovation,SingleSubarrayAccess(SSA),fundamentallyre-organizesthelay-outofDRAMarraysandthemappingofdatatothesearrayssothatanentirecachelineisfetchedfromasinglesubarray.Itrequiresadi erentinterfacetothememorycontroller,reducesdynamicandbackgroundenergy(byabout6Xand5X),incursaslightareapenalty(4%),andcanevenleadtoperformanceimprovements(54%onaverage)byreduc-ingqueuingdelays.Thethirdinnovationfurtherpenalizesthecost-per-bitmetricbyaddingachecksumfeaturetoeachcacheline.Thischecksumerror-detectionfeaturecanthenPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.ISCA'10,June19–23,2010,Saint-Malo,France.Copyright2010ACM978-1-4503-0053-7/10/06...$10.00.beusedtobuildstrongerRAID-likefaulttolerance,includ-ingchipkill-levelreliability.SuchatechniqueisespeciallycrucialfortheSSAarchitecturewheretheentirecachelineislocalizedtoasinglechip.ThisDRAMchipmicroarchi-tecturalchangeleadstoadramaticreductionintheenergyandstorageoverheadsforreliability.Theproposedarchitec-tureswillalsoapplytootheremergingmemorytechnologies(suchasresistivememories)andwillbelessdisruptivetostandards,interfaces,andthedesign\rowiftheycanbein-corporatedinto rst-generationdesigns.CategoriesandSubjectDescriptorsB.3.1[MemoryStructures]:SemiconductorMemories|Dynamicmemory(DRAM);B.3.2[MemoryStructures]:DesignStyles|Primarymemory;B.8.1[PerformanceandReliability]:Reliability,TestingandFault-Tolerance;C.5.5[ComputerSystemImplementation]:ServersGeneralTermsDesign,Performance,ReliabilityKeywordsDRAMArchitecture,Energy-eciency,Locality,Chipkill,Subarrays1.INTRODUCTIONThecomputinglandscapeisundergoingamajorchange,primarilyenabledbyubiquitouswirelessnetworksandtherapidincreaseintheusageofmobiledeviceswhichaccesstheweb-basedinformationinfrastructure.ItisexpectedthatmostCPU-intensivecomputingmayeitherhappeninservershousedinlargedatacenters,e.g.,cloudcomputingandotherwebservices,orinmany-corehigh-performancecomputing(HPC)platformsinscienti clabs.Inbothsituations,itisexpectedthatthememorysystemwillbeproblematicintermsofperformance,reliability,andpowerconsumption.Thememorywallisnotnew:longDRAMmemorylaten-cieshavealwaysbeenaproblem.Giventhatlittlecanbedoneaboutthelatencyproblem,DRAMvendorshavecho-sentooptimizetheirdesignsforimprovedbandwidth,in-creaseddensity,andminimumcost-per-bit.Withtheseob-jectivesinmind,afewDRAMarchitectures,standards,and interfaceswereinstitutedinthe1990sandhavepersistedsincethen.However,theobjectivesindatacenterserversandHPCplatformsofthefuturewillbeverydi erentthanthosethatarereasonableforpersonalcomputers,suchasdesktopmachines.Asaresult,traditionalDRAMarchitec-turesarehighlyinecientfromafuturesystemperspective,andareinneedofamajorrevamp.Considerthefollowingtechnologicaltrendsthatplaceverydi erentdemandsonfutureDRAMarchitectures:Energy:Whileenergywasnevera rst-orderdesigncon-straintinpriorDRAMsystems,ithascertainlyemergedastheprimaryconstrainttoday,especiallyindatacen-ters.Energyeciencyindatacentershasalreadybeenhighlightedasanationalpriority[50].Manystudiesat-tribute25-40%oftotaldatacenterpowertotheDRAMsystem[11,33,34,37].ModernDRAMarchitecturesareill-suitedforenergy-ecientoperationbecausetheyaredesignedtofetchmuchmoredatathanrequired.Thisoverfetchwastesdynamicenergy.Today'sDRAMsem-ploycoarse-grainedpower-downtacticstoreduceareaandcost,but nergrainedapproachescanfurtherreduceback-groundenergy.Reducedlocality:Single-coreworkloadstypicallyexhibithighlocality.Consequently,currentDRAMsfetchmanykilobytesofdataoneveryaccessandkeeptheminopenrowbu erssothatsubsequentrequeststoneighboringdataelementscanbeservicedquickly.Thehighdegreeofmulti-threadinginfuturemulti-cores[42]impliesthatmemoryrequestsfrommultipleaccessstreamsgetmulti-plexedatthememorycontroller,thusdestroyingalargefractionoftheavailablelocality.Theseverityofthisprob-lemwillincreasewithincreasedcoreandmemorycon-trollercountsthatareexpectedforfuturemicroproces-sorchips.Thistrendisexacerbatedbytheincreaseduseofaggregatedmemorypools(\memoryblades"thatarecomprisedofmanycommodityDIMMs)thatservesev-eralCPUsocketsinane orttoincreaseresourceutiliza-tion[34].ThismandatesthatfutureDRAMarchitecturesplacealowerpriorityonlocalityandahigherpriorityonparallelism.QueuingDelays:Forseveralyears,queuingdelaysatthememorycontrollerwererelativelysmallbecauseasinglecoretypicallyhadrelativelyfewpendingmemoryoper-ationsandDRAMsystemswereabletosteeplyincreasepeakmemorybandwidtheveryyear[20].Inthefuture,thenumberofpinsperchipisexpectedtogrowveryslowly.The2007ITRSRoad-map[26]expectsa1.47xincreaseinthenumberofpinsoveran8-yeartime-frame{overthesameperiod,Moore'sLawdictatesatleasta16xincreaseinthenumberofcores.Thisimpliesthatrequestsfrommanycoreswillbecompetingtoutilizethelimitedpinbandwidth.Severalstudieshavealreadyhigh-lightedtheemergenceofqueuingdelayasamajorbottle-neck[24,31,40,41,44,56].ADRAMarchitecturethatisgearedtowardshigherparallelismwilllikelybeabletode-queuerequestsfasterandbetterutilizetheavailablelimiteddatabandwidth.EcientReliability:RecentstudieshavehighlightedtheneedforDRAMarchitecturesthatareresilienttosinglefaultsorevenfailurewithinanentireDRAMchip[8,46],especiallyindatacenterplatforms.Becausethesefault-tolerantsolutionsarebuiltuponcommodityDRAMchips,theyincurveryhighoverheadsintermsofenergyandcost.NewDRAMarchitecturescanprovidemuchmoreecientreliabilityiffault-tolerantfeaturesareintegratedintotheDRAMchipmicroarchitectureatdesigntime.LowerrelevanceofDRAMchiparea:DRAMvendorshavelongoptimizedthecost-per-bitmetric.However,giventhatdatacentersconsumeseveralbillionkilowatthoursofenergyeveryyear[50],ithasbeenshownthatthe3-yearoperatingenergycostsoftoday'sdatacentersequalthecapitalacquisitioncosts[33].Therefore,itmaynowbeacceptabletoincuraslightlyhighercost-per-bitwhenpurchasingDRAMaslongasitleadstosigni cantlylowerenergyfootprintsduringoperation.ThedesignofDRAMdevicesspeci callyaddressingthesetrendshas,tothebestofourknowledge,notbeenpreviouslystudiedandisnowmorecompellingthanever.WeattempttofundamentallyrethinkDRAMmicroarchitectureandor-ganizationtoachievehighlyreliable,highperformanceop-erationwithextremelylowenergyfootprints,allwithinac-ceptableareabounds.Inthiswork,weproposetwoinde-pendentdesigns,bothattemptingtoactivatetheminimumcircuitryrequiredtoreadasinglecacheline.Wemakethefollowingthreesigni cantcontributions:WeintroduceandevaluatePostedRASincombinationwithaSelectiveBitlineActivation(SBA)scheme.ThisentailsarelativelysimplechangetoDRAMmicroarchi-tecture,withonlyaminorchangetotheDRAMinterface,toprovidesigni cantdynamicenergysavings.WeproposeandevaluateareorganizationofDRAMchipsandtheirinterface,sothatcachelinescanbereadviaaSingleSubarrayAccess(SSA)inasingleDRAMchip.Thisapproachtradeso higherdatatransfertimesforgreater(dynamicandbackground)energysavings.Inordertoprovidechipkill-levelreliability[18,35]eventhoughwearereadingacachelineoutofasingleDRAMdevice,weproposeaddingachecksumtoeachcachelineintheSSADRAMtoprovideerrordetection.WethenevaluatetheuseofRAIDtechniquestoreconstructcachelinesintheeventofachipfailure.WhilethisstudyfocusesonDRAMasanevaluationvehi-cle,theproposedarchitectureswilllikelyapplyjustaswelltootheremergingstoragetechnologies,suchasphasechangememory(PCM)andspintorquetransferRAM(STT-RAM).2.BACKGROUNDANDMOTIVATION2.1DRAMBasicsandBaselineOrganizationWe rstdescribethetypicalmodernDRAMarchitec-ture[27].Formostofthepaper,ourdiscussionwillfocusonthedominantDRAMarchitecturetoday:JEDEC-styleDDRxSDRAM,anexampleisshowninFigure1.Modernprocessors[45,48,54]oftenintegratememorycon-trollersontheprocessordie.Eachmemorycontrollerisconnectedtooneortwodedicatedo -chipmemorychan-nels.ForJEDECstandardDRAM,thechanneltypicallyhasa64-bitdatabus,a17-bitrow/columnaddressbus,andan8-bitcommandbus[38].Multipledualin-linemem-orymodules(DIMMs)canbeaccessedviaasinglememorychannelandmemorycontroller.EachDIMMtypicallycom-prisesmultipleranks,eachrankconsistingofasetofDRAM Array 1/8 th fth … 1/8 th o f  th rowbuffer Onewordof One  word  of dataoutput Rank DRAMchipordevice Bank DIMM O hi Memorybusorchannel O nrc hi MemoryController Figure1:AnexampleDDRxSDRAMarchitecturewith1DIMM,2ranks,and8x4DRAMchipsperrank.chips.Wewillcallthisarank-set.Exactlyonerank-setisactivatedoneverymemoryoperationandthisisthesmall-estnumberofchipsthatneedtobeactivatedtocompleteareadorwriteoperation.Delaysontheorderofafewcyclesareintroducedwhenthememorycontrollerswitchesbetweenrankstosupportelectricalbusterminationrequire-ments.TheproposedDRAMarchitectureisentirelyfocusedontheDRAMchips,andhasneitherapositiveornegativee ectonrankissues.Figure1showsanexampleDIMMwith16totalDRAMchipsformingtworank-sets.EachDRAMchiphasanintrinsicwordsizewhichcorre-spondstothenumberofdataI/Opinsonthechip.AnxNDRAMchiphasawordsizeofN,whereNreferstothenumberofbitsgoingin/outofthechiponeachclocktick.Fora64-bitdatabusandx8chips,arank-setwouldrequire8DRAMchips(Figure1onlyshows8x4chipsperrank-settosimplifythe gure).IftheDIMMsupportsECC,thedatabusexpandsto72-bitsandtherank-setwouldconsistof9x8DRAMchips.Whenarankisselected,allDRAMchipsintherank-setreceiveaddressandcommandsignalsfromthememorycontrolleronthecorrespondingsharedbuses.EachDRAMchipisconnectedtoasubsetofthedatabus;ofthe64-bitdatapacketbeingcommunicatedonthebusonaclockedge,eachx8chipreads/writesan8-bitsubset.Arankisitselfpartitionedintomultiplebanks,typically4-16.Eachbankcanbeconcurrentlyprocessingadi erentmemoryrequest,thusa ordingalimitedamountofmem-oryparallelism.EachbankisdistributedacrosstheDRAMchipsinarank;theportionofabankineachchipwillbere-ferredtoasasub-bank.Theorganizationofasub-bankwillbedescribedinthenextparagraph.Whenthememorycon-trollerissuesarequestforacacheline,alltheDRAMchipsintherankareactivatedandeachsub-bankcontributesaportionoftherequestedcacheline.BystripingacachelineacrossmultipleDRAMchips,theavailablepinandchannelbandwidthforthecachelinetransfercanbeenhanced.Ifthedatabuswidthis64bitsandacachelineis64bytes,thecachelinetransferhappensinanburstof8datatransfers.IfachipisanxNpart,eachsub-bankisitselfpartitionedintoNarrays(seeFigure1).EacharraycontributesasinglebittotheN-bittransferonthedataI/Opinsforthatchiponaclockedge.Anarrayhasseveralrowsandcolumnsofsingle-bitDRAMcells.AcachelinerequeststartswithaRAScommandthatcarriesthesubsetofaddressbitsthatidentifythebankandtherowwithinthatbank.Eacharraywithinthatbanknowreadsoutanentirerow.Thebitsreadoutaresavedinlatches,referredtoastherowbu er.Therowisnowconsideredopened.Thepagesizeorrowbu ersizeisde nedasthenumberofbitsreadoutofallarraysinvolvedinabankaccess(usually4-16KB).Ofthese,onlyacachelineworthofdata(identi edbytheCAScommandanditsassociatedsubsetofaddressbits)iscommunicatedonthememorychannelforeachCPUrequest.Eachbankhasitsownrowbu er,sotherecanpotentiallybe4-16openrowsatanytime.Thebankscanbeaccessedinparallel,butthedatatransfershavetobeserializedovertheshareddatabus.Iftherequesteddataispresentinanopenrow(arowbu erhit),thememorycontrollerisawareofthis,anddatacanbereturnedmuchfaster.Iftherequesteddataisnotpresentinthebank'srowbu er(arowbu ermiss),thecurrentlyopenrow(ifoneexists)hasto rstbeclosedbeforeopeningthenewrow.Topreventtheclosingoftherowfrombeingonthecriticalpathforthenextrowbu ermiss,thecontrollermayadoptaclose-pagepolicythatclosestherowrightafterreturningtherequestedcacheline.Alternatively,anopen-pagepolicykeepsarowopenuntilthebankreceivesarequestforadi erentrow.Asanexamplesystem,considera4GBsystem,withtwo2GBranks,eachconsistingofeight256MBx8,4-bankdevices,servinganL2witha64bytecachelinesize.OneveryrequestfromtheL2cache,eachdevicehastoprovide8bytesofdata.Eachofthe4banksina256MBdeviceissplitinto8arraysof8MBeach.Ifthereare65,536rowsof1024columnsofbitsineacharray,arowaccessbringsdown1024bitsperarrayintotherowbu er,givingatotalrowbu ersizeof65,536bitsacross8chipsof8arrayseach.Thepagesizeistherefore65,536bits(8KBytes)andofthese,only64Bytesare nallyreturnedtotheprocessor,witheachoftheeightchipsbeingresponsiblefor64bitsofthecacheline.Suchabaselinesystemusuallysigni cantlyunder-utilizesthebitsitreadsout(intheaboveexample,onlyabout0.8%oftherowbu erbitsareutilizedforasinglecachelineaccess)andendsupunnecessarilyactivatingvariouscircuitsacrosstherank-set.2.2MotivationalDataRecentstudieshaveindicatedthehighenergyneedsofdatacenters[50]andthatmemorycontributesupto40%oftotalserverpowerconsumption[11,33,37].Westartwithaworkloadcharacterizationonoursimulationinfrastructure(methodologydetailsinSection4.1).Figure2showsthetrendofsteeplydroppingrow-bu erhitratesasthenumberofthreadssimultaneouslyaccessingmemorygoesup.Weseeaverageratesdropfromover60%fora1coresystemto35%fora16coresystem.Wealsoseethatwheneverarowisfetchedintotherow-bu er,thenumberoftimesitisusedbeforebeingclosedduetoacon\rictisoftenjustoneortwo(Figure3).Thisindicatesthatevenonbenchmarkswithhighlocalityandgoodaveragerowbu erhitrates(forexample,cg),alargenumberofpagesstilldon'thavemuchreuseintherow-bu er.Thesetrendshavealsobeenob-servedinpriorworkonMicro-Pages[47].Thismeansthattheenergycostsofactivatinganentire8KBrowisamor-tizedoververyfewaccesses,wastingsigni cantenergy. 10 2030405060708090100 o bufferhitrate(%) Core Core 102030405060708090100 Rowbufferhitrate(%) Core Core 16Core Figure2:Rowbufferhitratetrend 0.010.020.030.040.050.060.070.080.090.0100.0PercentageofRowFetches UseCount�3 UseCount UseCount UseCount RBHitrate(%) 0.010.020.030.040.050.060.070.080.090.0100.0 PercentageofRowFetches UseCount�3 UseCount UseCount UseCount RBHitrate(%) Figure3:Rowusecountfor8cores3.PROPOSEDARCHITECTUREWestartwiththepremisethatthetraditionalrow-bu erlocalityassumptionisnolongervalid,andtryto ndanenergy-optimalDRAMdesignwithminimalimpactsonareaandlatency.Our rstnoveldesign(SelectiveBitlineActi-vation,Section3.1)requiresminorchangestoDRAMchipmicroarchitecture,butiscompatiblewithexistingDRAMstandardsandinterfaces.Thesecondnoveldesign(SingleSubarrayAccess,Section3.2)requiresnon-trivialchangestoDRAMchipmicroarchitectureanditsinterfacetothememorycontroller.Section3.3describesournovelchipkillsolutionfortheproposedarchitecture.3.1SelectiveBitlineActivation(SBA)Inane orttomitigatetheoverfetchproblemwithmin-imaldisruptiontoexistingdesignsandstandards,wepro-posethefollowingtwosimplemodi cations:(i)weactivateamuchsmallersegmentofthewordlineand(ii)weacti-vateonlythosebitlinescorrespondingtotherequestedcacheline.Notethatwewillstillneedawirespanningthearraytoidentifytheexactsegmentofwordlinethatneedstobeactivatedbutthisisverylightlyloadedandthereforehaslowdelayandenergy.Thus,wearenotchangingthewaydatagetslaidoutacrossDRAMchiparrays,buteveryac-cessonlybringsdowntherelevantcachelineintotherowbu er.Asaresult,thenotionofanopen-pagepolicyisnowmeaningless.Aftereveryaccess,thecachelineisimmedi-atelywrittenback.Mostoftheperformancedi erencefromthisinnovationisbecauseoftheshifttoaclose-pagepolicy:forworkloadswithlittlelocality,thiscanactuallyresultinperformanceimprovementsasthepageprechargeafterwrite-backistakeno thecriticalpathofthesubsequentrowbu ermiss.Next,wediscussthemicroarchitecturalmodi cationsinmoredetail. RX0RX1RX2 METAL MEMORY ARRAYSMEMORY ARRAYS SWL MWL0 SWL SWL METAL POLY-Si BITLINESBITLINESBITLINESBITLINES SWL MWL1 SWL SWL MWL1 SWLSWLSWL MWL – Main Wordline SWL – Sub Wordline RX – Re g ion SelectFigure courtesy “VLSI Memory Chip Design”, KIth g K . It o h Figure4:Hierarchicalwordlinewithregionselect.MemorysystemshavetraditionallymultiplexedRASandCAScommandsonthesameI/Olinesduetopincountlimitations.Thissituationisunlikelytochangeduetotech-nologicallimitations[26]andisahardconstraintforDRAMoptimization.Inatraditionaldesign,oncetheRASarrives,enoughinformationisavailabletoactivatetheappropriatewordlinewithinthearray.Thecellsinthatrowplacetheirdataonthecorrespondingbitlines.Oncetherow'sdataislatchedintotherowbu er,theCASsignalisusedtoreturnsomefractionofthemanybitsreadfromthatarray.Inourproposeddesign,insteadoflettingtheRASimmediatelyac-tivatetheentirerowandallthebitlines,wewaituntiltheCAShasarrivedtobeginthearrayaccess.TheCASbitsidentifythesubsetoftherowthatneedstobeactivatedandthewordlineisonlydriveninthatsection.Correspondingly,onlythosebitlinesplacedataintherowbu er,savingtheactivationenergyoftheremainingbits.Therefore,weneedtheRASandtheCASbeforestartingthearrayaccess.SincetheRASarrivesearly,itmustbestoredinaregisteruntiltheCASarrives.WerefertothisprocessasPosted-RAS1.BecausewearenowwaitingfortheCAStobeginthear-rayaccess,someadditionalcycles(ontheorderof10CPUcycles)areaddedtotheDRAMlatency.Weexpectthisim-pact(quanti edinSection4)toberelativelyminorbecauseofthehundredsofcyclesalreadyincurredoneveryDRAMaccess.Noteagainthatthischangeiscompatiblewithex-istingJEDECstandards:thememorycontrollerissuesthesamesetofcommands,wesimplysavetheRASinaregisteruntiltheCASarrivesbeforebeginningthearrayaccess.Theselectivebitlineactivationismadepossiblebyonlyactivatingasmallsegmentofthewordline.Weemployhier-archicalwordlinestofacilitatethis,atsomeareacost.EachwordlineconsistsofaMainWordline(MWL),typicallyrunin rst-levelmetal,controllingSub-Wordlines(SWL),typ-icallyruninpoly,whichactuallyconnecttothememorycells(seeFigure4).TheMWLisloadedonlybyafew\AND"gatesthatenablethesub-wordlines,signi cantlyre-ducingitscapacitance,andthereforeitsdelay.\RegionSe-lect(RX)"signalscontrolactivationofspeci cSWLs.HierarchicalwordlineshavebeenpreviouslyproposedforDRAMs[25]toreducedelay(ratherthanenergy).Untilnow,othertechniques(metalshuntedwordlines[28],forin-stance)partiallyachievedwhathasbeenperceivedasthe 1Manymemorycontrollersintroduceagapbetweentheis-sueoftheRASandCASsothattheCASarrivesjustastherowbu erisbeingpopulatedandthedevice'sTrcdcon-straintissatis ed[27].SomememorysystemssendtheCASimmediatelyaftertheRAS.TheCASisthensavedinareg-isterattheDRAMchipuntiltherowbu erisready.ThisisreferredtoasPosted-CAS[29].WerefertoourschemeasPosted-RASbecausetheRASissavedinaregisteruntilthearrivaloftheCAS. advantageofhierarchicalwordlines:signi cantreductionsinwordlinedelay.Inashuntedwordline,ametalwordlineisstitchedtothelow-pitchpolywordlineatregularinter-valsbymetal-polycontacts.Thisreducesthewordlinede-laybylimitingthehighresistancepolytoasmalldistancewhilesavingareabyhavingonlyafewmetal-polycontacts.Theincreasedareacostsofhierarchicalwordlineshavethere-forenotbeenjusti ablethusfar.Now,withtheincreasingimportanceofenergyconsiderations,webelievethatusinghierarchicalwordlinesisnotonlyacceptable,butactuallynecessary.NotethatwordlinesdonotcontributeasmuchtooverallDRAMenergy,sothisfeatureisimportantnotforitswordlineenergysavings,butbecauseitenablesselectivebitlineactivation.Inourproposeddesign,asubsetoftheCASaddressisusedtotriggertheRXsignal,reducingtheactivationareaandwordline/bitlineenergy.NotethatsincetheMWLisnotdirectlyconnectedtothememorycells,theactivationoftheMWLacrossthearraydoesnotresultindestructionofdata,sinceonlythesmallsubsetofcellsconnectedtotheactiveSWLreadtheirdataout.Weincorporatedananalyticalmodelforhierarchicalword-linesintoCACTI6.5[39,49](moredetailsinSection4.1)toquantifytheareaoverhead.Forthespeci cDRAMpartdescribedinSection4.1,weobservedthatanareaoverheadof100%wasincurredwhenenoughSWLswereintroducedtoactivateexactlyonecachelineinabank.ThisisbecauseofthehighareaoverheadintroducedbytheANDgateandRXsignalsforafewmemorycells.Whilethisresultsinactivatingaminimumnumberofbitlines,thecostmaybeprohibitive.However,wecantrade-o energyforlowercostbynotbeingasselective.Ifweweretoinsteadreadout16cachelines,theSWLsbecome16timeslonger.Thisstillleadstohighenergysavingsoverthebaseline,andamoreacceptableareaoverheadof12%.MostofourresultsinSec-tion4pertaintothismodel.Eventhoughwearereadingout16cachelines,wecontinuetousetheclose-pagepolicy.Insummary,theSBAmechanism(i)reducesbitlineandwordlinedynamicenergybyreadingoutalimitednumberofcachelinesfromthearrays(tosigni cantlyreduceover-fetch),(ii)impactsperformance(negativelyorpositively)byusingaclose-pagepolicy,(iii)negativelyimpactsperfor-mancebywaitingforCASbeforestartingarrayaccess,(iv)increasesareaandcostbyrequiringhierarchicalwordlines,and nally(v)doesnotimpacttheDRAMinterface.Aswewilldiscusssubsequently,thismechanismdoesnotimpactanychipkillsolutionsfortheDRAMsystembecausethedataorganizationacrossthechipshasnotbeenchanged.3.2SingleSubarrayAccess(SSA)WhiletheSBAdesigncaneliminateoverfetch,itisstillanattempttoshoehorninenergyoptimizationsinaman-nerthatconformstomodern-dayDRAMinterfacesanddatalayouts.Giventhatwehavereachedanin\rectionpoint,amajorrethinkofDRAMdesigniscalledfor.Anenergy-ecientarchitecturewillalsoberelevantforotheremergingstoragetechnologies.Thissub-sectionde nesanenergy-optimizedarchitecture(SSA)thatisnotencumberedbyex-istingstandards.ManyfeaturesincurrentDRAMshavecontributedtobet-terlocalityhandlingandlowcost-per-bit,butalsotohighenergyoverhead.Arraysaredesignedtobelargestructuressothattheperipheralcircuitryisbetteramortized.WhileDRAMscanallowlow-powersleepmodesforarrays,the ONEDRAMCHIP ADDR/CMD BUS 64 Bytes Subarray ONEDRAMCHIP DIMM ... Subarray Bitlines Rowbuffer DIMM 8 8 Rowbuffer 8 8 8 8 8 8 DATA BUS MEMORY CONTROLLER Bank Interconnect I/O Figure5:SSADRAMArchitecture.largesizeofeacharrayimpliesthatthepower-downgranu-larityisrathercoarse,o eringfewerpower-savingopportu-nities.SinceeachDRAMchiphaslimitedpinbandwidth,acachelineisstripedacrossalltheDRAMchipsonaDIMMtoreducethedatatransfertime(andalsotoimproverelia-bility).Asaresult,asingleaccessactivatesmultiplechips,andmultiplelargearrayswithineachchip.Overview:Toovercometheabovedrawbacksandminimizeenergy,wemovetoanextrememodelwhereanentirecachelineisreadoutofasinglesmallarrayinasingleDRAMchip.Thissmallarrayishenceforthreferredtoasa\sub-array".Figure5showstheentirememorychannelandhowvarioussub-componentsareorganized.Thesubarrayisaswideasacacheline.SimilartoSBA,weseeadramaticre-ductionindynamicenergybyonlyactivatingenoughbitlinestoreadoutasinglecacheline.Further,remaininginactivesubarrayscanbeplacedinlow-powersleepmodes,savingbackgroundenergy.TheareaoverheadofSSAislowerthanthatofSBAsincewedividetheDRAMarrayatamuchcoarsergranularity.IftheDRAMchipisanx8part,weeitherneedtoprovide8wiresfromeachsubarraytotheI/Opinsorprovideasinglewireandserializethetransfer.WeadopttheformeroptionandasshowninFigure5,thesubarraysplacetheirdataonashared8-bitbus.Inaddition,sincetheentirecachelineisbeingreturnedviathelimitedpinsonasinglechip,ittakesmanymorecyclestoe ectthedatatransfertotheCPU.Thus,thenewdesignclearlyincursahigherDRAMlatencybecauseofslowdatatransferrates.Italsoonlysupportsaclose-pagepolicy,whichcanimpactperformanceeitherpositivelyornegatively.Ontheotherhand,thedesignhasmuchhigherconcurrency,aseachDRAMchipcanbesimultaneouslyservicingadi erentcacheline.Sinceeachchipcanimplementseveralindependentsubarrays,therecanalsobemuchhigherintra-chiporbank-levelconcurrency.Wenextexamineournewdesigningreaterdetail.MemoryControllerInterface:Justasinthebaseline,asingleaddress/commandbusisusedtocommunicatewithallDRAMchipsontheDIMM.TheaddressisprovidedintwotransfersbecauseofpinlimitationsoneachDRAMchip.ThisissimilartoRASandCASinaconventionalDRAM,exceptthattheyneednotbecalledassuch(thereisn'tacolumn-selectinourdesign).Theaddressbitsfrombothtransfersidentifyauniquesubarrayandrow(cacheline)withinthatsubarray.Partoftheaddressnowidenti estheDRAMchipthathasthecacheline(notrequiredinconven-tionalDRAMbecauseallchipsareactivated).Theentire addressisrequiredbeforethesubarraycanbeidenti edoraccessed.SimilartotheSBAtechnique,afewmorecyclesareaddedtotheDRAMaccesslatency.Anadditionalre-quirementisthateverydevicehastobecapableoflatchingcommandsastheyarereceivedtoenablethecommandbustothenmoveontooperatingadi erentdevice.Thiscaneasilybeachievedbyhavingasetofregisters(eachcapableofsignalingonedevice)connectedtoademultiplexerwhichreadscommandso thecommandbusandredirectsthemappropriately.Thedatabusisphysicallynodi erentthantheconventionaldesign:foranxNDRAMchip,NdatabitsarecommunicatedbetweentheDRAMchipandthememorycontrollereverybuscycle.Logically,theNbitsfromeveryDRAMchiponaDIMMrankwerepartofthesamecachelineintheconventionaldesign;nowtheyarecompletelyin-dependentanddealwithdi erentcachelines.Therefore,itisalmostasifthereareeightindependentnarrowchannelstothisDIMM,withthecaveatthattheyallshareasingleaddress/commandbus.SubarrayOrganization:Theheightofeachsubarray(i.e.thenumberofcachelinesinagivensubarray)directlyde-terminesthedelay/energyperaccesswithinthesubarray.Manysmallsubarraysalsoincreasethepotentialforpar-allelismandlow-powermodes.However,alargenumberofsubarraysimpliesamorecomplexon-dienetworkandmoreenergyanddelaywithinthisnetwork.Italsoen-tailsgreateroverheadfromperipheralcircuitry(decoders,drivers,senseamps,etc.)persubarraywhichdirectlyim-pactsareaandcost-per-bit.Thesearebasictrade-o scon-sideredduringDRAMdesignandevenincorporatedintoanalyticalcachemodelssuchasCACTI6.5[39,49].Fig-ure5showshowanumberofsubarraysinacolumnsharearowbu erthatfeedsthesharedbus.Thesubarraysshar-ingarowbu erarereferredtoasabank,andsimilartotheconventionalmodel,asinglebankcanonlybedealingwithoneaccessatatime.OurSSAimplementationmodelshierarchicalbitlinesinwhichdatareadfromasubarrayaresenttotherowbu erthroughsecondlevelbitlines.Todis-tributeloadandmaximizeconcurrency,dataisinterleavedsuchthatconsecutivecachelinesare rstplacedindi erentDRAMchipsandthenindi erentbanksofthesamechip.Tolimittheimpactonareaandinterconnectoverheads,ifweassumethesamenumberofbanksperDRAMchipasthebaseline,westillendupwithamuchhighernumberoftotalbanksontheDIMM.Thisisbecauseinthebaselineorganization,thephysicalbanksonallthechipsaresim-plypartsoflargerlogicalbanks.IntheSSAdesign,eachphysicalbankisindependentandamuchhigherdegreeofconcurrencyiso ered.OuranalysiswithaheavilyextendedversionofCACTI6.5showedthattheareaoverheadofSSAisonly4%.Sincesubarraywidthsareonly64bytes,sequentialrefreshatthisgranularitywillbemoretime-consuming.However,itisfairlyeasytorefreshmultiplebankssimultaneously,i.e.,theysimplyactasonelargebankforrefreshpurposes.Inaddition,thereexistsimpletechniquestoperformrefreshthatkeeptheDRAMcell'saccesstransistoronlongenoughtorechargethestoragecapacitorimmediatelyafterade-structiveread,withoutinvolvingtherow-bu er[27].Power-Downmodes:IntheSSAarchitecture,acachelinerequestisservicedbyasinglebankinasingleDRAMchip,andonlyasinglesubarraywithinthatbankisacti-vated.Sincetheactivation\footprint"oftheaccessismuchsmallerintheSSAdesignthaninthebaseline,thereistheopportunitytopower-downalargeportionoftheremainingareathatmayenjoylongerspellsofinactivity.DatasheetsfromMicron[38]indicatethatmodernchipsalreadysupportmultiplepower-downmodesthatdisablevariouscircuitryliketheinputandoutputbu ersorevenfreezetheDLL.Thesemodesdonotdestroythedataonthechipandthechipcanbereactivatedwithalatencypenaltyproportionaltotheamountofcircuitrythathasbeenturnedo andthedepthofthepower-downstate.Weadoptasimplestrategyforpower-down:ifasubarrayhasbeenIdleforIcycles,itgoesintoapower-downmodethatconsumesPtimeslessbackgroundpowerthantheactivemode.Whenarequestislatersenttothissubarray,aWcyclelatencypenaltyisincurredforwake-up.Theresultssectionquanti estheper-formanceandpowerimpactforvariousvaluesofI,P,andW.ImpactSummary:Insummary,theproposedorganiza-tiontargetsdynamicenergyreductionbyonlyactivatingasinglechipandasinglesubarray(withshortwordlinesandexactlytherequirednumberofbitlines)whenaccessingacacheline.Areaoverheadisincreased,comparedtocon-ventionalDRAM,becauseeachsmallsubarrayincurstheoverheadofperipheralcircuitryandbecauseaslightlymorecomplexon-dieinterconnectisrequired.Backgroundenergycanbereducedbecausealargefractionoftheon-chipreales-tateisinactiveatanypointandcanbeplacedinlow-powermodes.TheinterfacebetweenthememorycontrollerandDRAMchipshasbeenchangedbye ectivelysplittingthechannelintomultiplesmallerwidthchannels.Theimpactonreliabilityisdiscussedinthenextsub-section.Perfor-manceisimpactedfavorablybyhavingmanymorebanksperDIMMandhigherconcurrency.Similartothebaseline,ifweassumethateachchiphaseightbanks,theentireDIMMnowhas64banks.Performancemaybeimpactedpositivelyornegativelybyadoptingaclose-pagepolicy.Performanceisnegativelyimpactedbecausethecachelineisreturnedtothememorycontrollerviaseveralserializeddatatransfers(anx8partwilltake64transferstoreturna64bytecacheline).Anegativeimpactisalsoincurredbecausethesubar-rayaccesscanbeginonlyaftertheentireaddressisreceived.WebelievethatSSAissuperiortoSBA,althoughitre-quiresalargerre-designinvestmentfromtheDRAMcom-munity.Firstly,inordertolimittheareaoverheadofhierar-chicalwordlines,SBAisforcedtofetchmultiplecachelines,thusnotcompletelyeliminatingoverfetch.SSAthereforeyieldshigherdynamicenergysavings.BymovingfromlargearraysinSBAtosmallsubarraysinSSA,SSAalso ndsmanymoreopportunitiestoplacesubarraysinlow-powerstatesandsaveleakageenergy.Intermsofperformance,SSAishurtbythelongdatatransfertime,andwilloutdoSBAinworkloadsthathaveahighpotentialforbank-levelconcurrency.3.3ChipkillRecentstudieshaveshownthatDRAMsareoftenplaguedwitherrorsandcanleadtosigni cantserverdowntimeindatacenters[46].Therefore,alow-powerDRAMdesigntar-getedatdatacentersmustbeamenabletoanarchitecturethatprovidesahighstandardofreliability.Acommonex-pectationofbusiness-criticalserverDRAMsystemsisthattheyareabletowithstandasingleDRAMchipfailure.Just asanentirefamilyoferror-resilientschemescanbebuiltforbitfailures(forexample,SingleErrorCorrectionDoubleEr-rorDetection,SECDED),afamilyoferror-resilientschemescanalsobebuiltforchipfailure(forexample,SingleChiperrorCorrectionDoubleChiperrorDetection,SCCDCD),andthesearereferredtoasChipkill[18,35].WenowfocusonthedesignofanSCCDCDchipkillscheme;thetechniquecanbeeasilygeneralizedtoproducestronger\ravorsoferror-resilience.First,consideraconventionaldesignwhereeachword(say64bits)hasbeenappendedwithan8-bitECCcode,topro-videSECDED.Forachipkillscheme,eachDRAMchipcanonlycontributeonebitoutofthe72-bitword.Ifachipweretocontributeanymore,chipfailurewouldmeanmulti-bitcorruptionwithinthe72-bitword,anerrorthataSECDEDcodecannotrecoverfrom.Therefore,each72-bitwordmustbestripedacross72DRAMchips.Whena64-bytecachelineisrequested,72bytesarereadoutofthe72DRAMchips,makingsurethateach72-bitwordobtainsonlyasinglebitfromeachDRAMchip.SuchanorganizationwasadoptedintheDellPoweredge6400/6450servers[35].ThisprovidessomeoftherationaleforcurrentDRAMsystemsthatstripeacachelineacrossseveralDRAMchips.Thisisclearlyen-ergyinecientas72DRAMchipsareactivatedandaverysmallfractionofthereadbitsarereturnedtotheCPU.ItispossibletoreducethenumberofDRAMchipsactivatedperaccessifweattachECCcodestosmallerwordsashasbeendoneintheIBMNet nitysystems[18].Thiswillhavehighstorageoverhead,butgreaterenergyeciency.Forexam-ple,inadesignattachinganECCwordto8bits,say,onemayneed veextraDRAMchipspereightDRAMchipsonasingleDIMM.ECCgetsprogressivelymoreecientasthegranularityatwhichitisattachedisincreased.IntheSSAdesign,weintendtogettheentirecachelinefromasingleDRAMchipaccess.IfthisDRAMchipweretoproducecorrupteddata,theremustbeawaytore-constructit.Thisisaproblemformulationalmostexactlythesameasthatforreliabledisks.Wethereforeadoptasolutionverysimilartothatofthewell-studiedRAID[20]solutionfordisks,butthathasneverbeenpreviouslyemployedwithinaDIMM.NotethatsomecurrentserversystemsdoemployRAID-likeschemesacrossDIMMs[2,6];withinaDIMM,conventionalECCwithanextraDRAMchipisemployed.Thesesu erfromhighenergyoverheadsduetothelargenumberofchipsaccessedoneveryreadorwrite.Ourap-proachisdistinctandmoreenergy-ecient.InanexampleRAIDdesign,asinglediskservesasthe\parity"disktoeightotherdatadisks.Onadiskaccess(speci callyinRAID-4andRAID-5),onlyasinglediskisread.Achecksumassoci-atedwiththereadblock(andstoredwiththedatablockonthedisk)letstheRAIDcontrollerknowifthereadiscorrectornot.Ifthereisanerror,theRAIDcontrollerre-constructsthecorruptedblockbyreadingtheothersevendatadisksandtheparitydisk.Inthecommonerror-freecase,onlyonediskneedstobeaccessedbecausethechecksumenablesself-containederrordetection.Itisnotfool-proofbecausetheblock+checksummaybecorrupted,andthechecksummaycoincidentallybecorrect(thelargerthechecksum,thelowertheprobabilityofsuchasilentdatacorruption).Also,theparityoverheadcanbemadearbitrarilylowbyhavingoneparitydiskformanydatadisks.Thisisstillgoodenoughforerrordetectionandrecoverybecausethechecksumhasalreadyplayedtheroleofdetectingandidentifyingthecor- DIMM DRAMDEVICE DIMM L0 L1 L2 L3 L4 L5 L6 L7 P0 C L9 L10 L11 L12 L13 L14 L1 P1 L8 DRAMDEVICE L9 C L10 C L11 C L12 C L13 C L14 C L1 5 C P1 C L8 C. ........ L56 L57 L58 L59 L60 L61 L62 L63 P7 L–Cache LineC –Local ChecksumP –Global Parity Figure6:ChipkillsupportinSSA(onlyshownfor64cachelines).ruptedbits.Thecatchisthatwritesaremoreexpensiveaseverywriterequiresareadoftheolddatablock,areadoftheoldparityblock,awritetothedatablock,andawritetotheparityblock.RAID-5ensuresthattheparityblocksaredistributedamongallninediskssothatnoonediskemergesasawritebottleneck.WeadoptthesameRAID-5approachinourDRAMSSAdesign(Fig.6).TheDRAMarraymicroarchitecturemustnowbemodi edtonotonlyaccommodateacacheline,butalsoitsassociatedchecksum.Weassumeaneightbitcheck-sum,resultinginastorageoverheadof1.625%fora64-bytecacheline.Thechecksumfunctionusesbitinversionsothatstuck-at-zerofaultsdonotgoundetected.ThechecksumisreturnedtotheCPUafterthecachelinereturnandthever-i cationhappensinthememorycontroller(alargerburstlengthisrequired,notadditionalDRAMpins).Wecannotallowtheveri cationtohappenattheDRAMchipbecauseacorruptedchipmaysimply\ragallaccessesassuccessfullypassingthechecksumtest.TheDIMMwillnowhaveoneextraDRAMchip,astorageoverheadof12.5%foroureval-uatedplatform.MostreadsonlyrequirethatoneDRAMchipbeaccessed.AwriterequiresthattwoDRAMchipsbereadandthenwritten.Thisistheprimaryperformanceoverheadofthisschemeasitincreasesbankcontention(notethatanincreaseinwritelatencydoesnotimpactperfor-mancebecauseofread-bypassingatintermediatebu ersatthememorycontroller).Wequantifythise ectinthere-sultssection.Thisalsoincreasesenergyconsumption,butitisstillfarlessthantheenergyofreliableornon-reliableconventionalDRAMsystems.Mostchipkill-levelreliabilitysolutionshaveahigherstor-ageoverheadthanourtechnique.Asdescribedabove,theenergy-ecientsolutionscanhaveashighas62.5%over-head,theDellPoweredgesolutionhasa12.5%overhead(butrequiressimultaneousaccessto72DRAMchips),andtherank-sub-settingDRAMmodelofAhnetal.[8]hasa37.5%overhead.Thekeytoourhighereciencyisthelo-calizationofanentirecachelinetoasingleDRAMchipandtheuseofchecksumforself-containederrordetectionatmodestoverhead(1.625%)plusaparitychip(12.5%for8-wayparity).Evenonwrites,whenfourDRAMaccessesarerequired,wetouchfewerDRAMchipsandreadonlyasinglecachelineineach,comparedtoanyofthepriorso-lutionsforchipkill[8,18,35].Therefore,ourproposedSSAarchitecturewithchipkillfunctionalityisbetterthanothersolutionsintermsofareacostandenergy.AsweshowinSection4,theperformanceimpactofwritecontentionisalsolowbecauseofthehighdegreeofbankconcurrencya ordedbySSA. Processor 8-CoreOOO,2GHz L1cache FullyPrivate,3cycle 2-way,32KBeachIandD L2cache Fullyshared,10cycle 8-way,2MB,64BCachelines Row-bu ersize 8KB DRAMFrequency 400MHz DRAMPart 256MB,x8 ChipsperDIMM 16 Channels 1 Ranks 2 Banks 4 T-rcd,T-cas,T-rp 5DRAMcyc Table1:Generalparameters4.RESULTS4.1MethodologyWemodelabaseline,8-core,out-of-orderprocessorwithprivateL1cachesandasharedL2cache.Weassumeamainmemorycapacityof4GBorganizedasshowninTable1.OursimulationinfrastructureusesVirtutech'sSIMICS[5]full-systemsimulator,without-of-ordertimingsupportedbySimics'`ooo-micro-arch'module.The`trans-staller'mod-ulewasheavilymodi edtoaccuratelycaptureDRAMde-vicetiminginformationincludingmultiplechannels,ranks,banksandopenrowsineachbank.Bothopen-andclose-rowpagemanagementpolicieswith rst-come- rst-serve(FCFS)and rst-ready- rst-come- rst-serve(FR-FCFS)schedulingwithappropriatequeuingdelaysareaccuratelymodeled.Wealsomodeloverlappedprocessingofcommandsbythemem-orycontrollertohideprechargeandactivationdelayswhenpossible.Wealsoincludeaccuratebusmodelsfordatatrans-ferbetweenthememorycontrollerandtheDIMMs.AddressmappingpolicieswereadoptedfromtheDRAMSim[52]frameworkandfrom[27].DRAMtiminginformationwasobtainedfromMicrondatasheets[38].Area,latencyandenergynumbersforDRAMbankswereobtainedfromCACTI6.5[1],heavilymodi edtoincludeac-curatemodelsforcommodityDRAM,bothforthebaselinedesignandwithhierarchicalwordlines.Bydefault,CACTIdividesalargeDRAMarrayintoanumberofmatswithanH-treetoconnectthemats.Suchanorganizationincurslowlatencybutrequireslargearea.However,traditionalDRAMbanksareheavilyoptimizedforareatoreducecostandemployverylargearrayswithminimalperipheralcir-cuitryoverhead.Readorwriteoperationsaretypicallydoneusinglongmulti-levelhierarchicalbitlinesspanningthear-rayinsteadofusinganH-treeinterconnect.Wemodi edCACTItore\rectsuchacommodityDRAMimplementa-tion.Notethatwithahierarchicalbitlineimplementation,thereisapotentialopportunitytotrade-o bitlineenergyforareabyonlyusinghierarchicalwordlinesatthehigher-levelbitlineandleavingthe rst-levelbitlinesuntouched.Inthiswork,wedonotexplorethistrade-o .Instead,wefocusonthemaximumenergyreductionpossible.TheDRAMen-ergyparametersusedinourevaluationarelistedinTable2.Weevaluateourproposalsonsubsetsofthemulti-threadedPARSEC[13],NAS[9]andSTREAM[4]benchmarksuites.Weruneveryapplicationfor2millionDRAMaccesses(cor-respondingtomanyhundredsofmillionsofinstructions)andreporttotalenergyconsumptionandIPC. Component Dynamic Energy(nJ) Decoder+Wordline +Senseamps-Baseline 1.429 Decoder+Wordline +Senseamps-SBA 0.024 Decoder+Wordline +Senseamps-SSA 0.013 Bitlines-Baseline 19.282 Bitlines-SBA/SSA 0.151 TerminationResistors Baseline/SBA/SSA 7.323 OutputDrivers 2.185 GlobalInterconnect Baseline/SBA/SSA 1.143 Low-powermode Background Power(mW) Active 104.5 PowerDown(3mem.cyc) 19.0 SelfRefresh(200mem.cyc) 10.8 Table2:Energyparameters4.2ResultsWe rstdiscusstheenergyadvantageoftheSBAandSSAschemes.Wethenevaluatetheperformancecharacteristicsandareaoverheadsoftheproposedschemesrelativetothebaselineorganization.4.2.1EnergyCharacteristicsFigure7showstheenergyconsumptionoftheclose-pagebaseline,SBA,andSSA,normalizedtotheopen-pagebase-line.Theclose-pagebaselineisclearlyworseintermsofen-ergyconsumptionthantheopen-pagebaselinesimplyduetothefactthatevenaccessesthatwerepotentiallyrow-bu erhits(thusnotincurringtheenergyofactivatingtheentirerowagain)nowneedtogothroughtheentireactivate-read-prechargecycle.Weseeanaverageincreaseinenergycon-sumptionby73%onaverage,withindividualbenchmarkbehaviorvaryingbasedontheirrespectiverow-bu erhitrates.WeseefromFigure8(anaverageacrossallbench-marks)thatinthebaselineorganizations(bothopenandcloserow),thetotalenergyconsumptioninthedeviceisdominatedbyenergyinthebitlines.Thisisbecauseeveryaccesstoanewrowresultsinalargenumberofbitlinesget-tingactivatedtwice,oncetoreaddataoutofthecellsintotherow-bu erandoncetoprechargethearray.MovingtotheSBAorSSAschemeseliminatesahugepor-tionofthisenergycomponent.BywaitingfortheCASsig-nalandonlyactivating/prechargingtheexactcachelinethatweneed,bitlineenergygoesdownbyafactorof128.Thisresultsinadramaticenergyreductiononeveryaccess.How-ever,asdiscussedpreviously,prohibitiveareaoverheadsne-cessitatecoarsergrainedselectioninSBA,leadingtoslightlylargerenergyconsumptioncomparedtoSSA.Comparedtoabaselineopen-pagesystem,weseeaveragedynamicmem-oryenergysavingsof3XinSBAandover6.4XinSSA.Notethattheproposedoptimizationsresultinenergyre-ductiononlyinthebitlines.Theenergyoverheadduetoothercomponentssuchasdecoder,pre-decoder,inter-bankbus,bustermination,etc.remainsthesame.Hence,theircontributiontothetotalenergyincreasesasbitlineenergygoesdown.LocalizingandmanagingDRAMaccessesat 0.501.001.502.002.50 e DRAMEnergyConsumption BaselineOpenRow BaselineCloseRow SBA 0.000.501.001.502.002.50 RelativeDRAMEnergyConsumption BaselineOpenRow BaselineCloseRow SBA SSA Figure7:DRAMdynamicenergyconsumption 20%30%40%50%60%70%80%90%100% TerminationResistors GlobalInterconnect Bitlines Decoder 0%10%20%30%40%50%60%70%80%90%100%BASELINE(OPENPAGE,FRFCFS)BASELINE(CLOSEDROW,FCFS)SBASSA TerminationResistors GlobalInterconnect Bitlines DecoderWordlineSenseamps Figure8:ContributorstoDRAMdynamicenergyagranularityas neasasubarrayallowsmoreopportu-nitytoputlargerpartsoftheDRAMintolow-powerstates.CurrentDRAMdevicessupportmultiplelevelsofpower-down,withdi erentlevelsofcircuitrybeingturnedo ,andcorrespondinglylargerwake-uppenalties.Weevaluatetwosimplelow-powermodeswithP(Powersavingsfactor)andW(Wakeup)valuescalculatedbasedonnumbersshowninTable2,obtainedfromtheMicrondatasheetandpowersys-temcalculator[3,38].Inthedeepestsleepmode,SelfRe-fresh,Pis10andWis200memorycycles.AlessdeepsleepmodeisPowerDown,wherePis5.5,butWisjust3memorycycles.WevaryI(Idlecyclethreshold)asmul-tiplesofthewake-uptimeW.Figures9and10showtheimpactoftheselow-powerstatesonperformanceanden-ergyconsumptionintheSSAorganization.WeseethatthemoreexpensiveSelfRefreshlow-powermodeactuallybuysusmuchlowerenergysavingscomparedtothemoree-cientPowerDownmode.Aswebecomelessaggressiveintransitioningtolow-powerstates(increaseI),theaveragememorylatencypenaltygoesdown,fromjustover5%tojustover2%forthe\Power-down"mode.Thepercentageoftimewecanputsubarraysinlow-powermodecorrespond-inglychangesfromalmost99%toabout86%withenergysavingsbetween81%and70%.TheperformanceimpactsaremuchlargerfortheexpensiveSelf-Refreshmode,goingfromover400%ataveryaggressiveItounder20%intheleastaggressivecase.Correspondingly,bankscanbeputinthisstatebetween95%and20%ofthetime,withenergysavingsrangingfrom85%to20%.Naturally,thesepowerdownmodescanbeappliedtothebaselinearchitectureaswell.However,thegranularityatwhichthiscanbedoneismuchcoarser,aDIMMbankatbest.Thismeansthattherearefeweropportunitiestomoveintolow-powerstates.As 100.00150.00200.00250.00300.00350.00400.00450.00 c entageIncreasein M emoryLatnecy SelfRefresh PowerDown 50.000.0050.00100.00150.00200.00250.00300.00350.00400.00450.00101001000PercentageIncreaseinMemoryLatnecyThresholdValue(xWakeuptime) SelfRefresh PowerDown Figure9:Memorylatencyimpactofusinglow-powerstates 20 30405060708090100entageReductionin a ckgroundEnergy SelfRefresh PowerDown 102030405060708090100101001000PercentageReductioninBackgroundEnergyThresholdValue(xWakeuptime) SelfRefresh PowerDown Figure10:Energyreductionusinglow-powerstatesacomparison,westudytheapplicationofthelow-overhead\PowerDown"statetothebaseline.We ndthatonaver-age,evenwithanaggressivesleepthreshold,bankscanonlybeputinthismodeabout80%ofthetime,whileincurringapenaltyof16%intermsofaddedmemorylatency.Beinglessaggressivedramaticallyimpactstheabilitytopowerdownthebaseline,withbanksgoingintosleepmodeonly17%ofthetimewithaminimal3%latencypenalty.Asanothercomparisonpoint,weconsiderthepercentageoftimesub-arraysorbankscanbeputinthedeepestsleepSelfRefreshmodeinSSAvs.thebaseline,foraconstant10%latencyoverhead.We ndthatsubarraysinSSAcangointodeepsleepnearly18%ofthetimewhereasbanksinthebaselinecanonlygointodeepsleepabout5%ofthetime.4.2.2PerformanceCharacteristicsEmployingeithertheSBAorSSAschemesimpactsmem-oryaccesslatency(positivelyornegatively)asshowninFig-ure11.Figure12thenbreaksthislatencydownintotheav-eragecontributionsofthevariouscomponents.Oneoftheprimaryfactorsa ectingthislatencyisthepagemanage-mentpolicy.Movingtoaclose-pagepolicyfromanopen-pagebaselineactuallyresultsinadropinaveragememorylatencybyabout17%foramajority(10of12)ofourbench-marks.ThishasfavorableimplicationsforSBAandSSAwhichmustuseaclose-pagepolicy.Theremainingbench-marksseeanincreaseinmemorylatencybyabout28%onaveragewhenmovingtoclose-page.Employingthe\Posted-RAS"schemeintheSBAmodelcausesanadditionalsmalllatencyofjustover10%onaverage(neglectingtwooutliers).AsseeninFigure12,forthesethreemodels,thequeuingdelayisthedominantcontributortototalmemoryaccesslatency.Priorwork[15]hasalsoshownthistobetruein 10000 200.00300.00400.00500.00600.00700.00800.00Cycles BaselineOpenPage BaselineClosePage SBA 0.00100.00200.00300.00400.00500.00600.00700.00800.00 Cycles BaselineOpenPage BaselineClosePage SBA SSA Figure11:Averagemainmemorylatency 20%30%40%50%60%70%80%90%100% DataTransfer DRAMCoreAccess RankSwitchingdelay(ODT) Command/Addr Transfer 0%10%20%30%40%50%60%70%80%90%100%BASELINE(OPENPAGE,FRFCFS)BASELINE(CLOSEDROW,FCFS)SBASSA DataTransfer DRAMCoreAccess RankSwitchingdelay(ODT) Command/AddrTransfer QueuingDelay Figure12:ContributorstototalmemorylatencymanyDRAMsystems.Wethereforeseethattheadditionallatencyintroducedbythe\Posted-RAS"doesnotsigni -cantlychangeaveragememoryaccesslatency.TheSSAscheme,however,hasanentirelydi erentbot-tleneck.Everycachelinereturnisnowserializedoverjust8linkstothememorycontroller.Thisdatatransferdelaynowbecomesthedominantfactorinthetotalaccesstime.However,thisiso settosomeextentbyalargeincreaseinparallelisminthesystem.Eachofthe8devicescannowbeservicingindependentsetsofrequests,signi cantlyreduc-ingthequeuingdelay.Asaresult,wedonotseeagreatlyincreasedmemorylatency.Onhalfofourbenchmarks,weseelatencyincreasesofjustunder40%.Theotherbench-marksareactuallyabletoexploittheparallelismmuchbet-ter,andthismorethancompensatesfortheserializationlatency,withaverageaccesstimegoingdownbyabout30%.Thesearealsotheapplicationswiththehighestmemoryla-tencies.Asaresult,overall,SSAinfactoutperformsallothermodels.Figure13showstherelativeIPCsofthevariousschemesunderconsideration.Likewesawforthememorylatencynumbers,amajorityofourbenchmarksperformbetterwithaclose-rowpolicythanwithanopen-rowpolicy.Weseeper-formanceimprovementsofjustunder10%onaverage(ne-glectingtwooutliers)for9ofour12benchmarks.Theotherthreesu ereddegradationsofabout26%onaverage.Thesewerethebenchmarkswithrelativelyhigherlast-levelcachemissrates(ontheorderof10every1000instructions).Em-ployingthe\PostedRAS"resultsinamarginalIPCdegrada-tionoverclose-rowbaseline,about4%onaverage,neglectingtwooutlierbenchmarks.TheSSAschemeseesaperformancedegradationof13%onaveragecomparedtotheopen-pagebaselineonthesixbenchmarksthatsawamemorylatencyincrease.Theother 0.501.001.502.002.50NormalizedIPC BaselineOpenPage BaselineClosePage SBA SSA 0.000.501.001.502.002.50 NormalizedIPC BaselineOpenPage BaselineClosePage SBA SSA SSAChipkill Figure13:NormalizedIPCsofvariousorganizations6benchmarkswithadecreasedmemoryaccesslatencyseeperformancegainsof54%onaverage.ThesehighnumbersareobservedbecausetheseapplicationsareclearlylimitedbybankcontentionandSSAaddressesthisbottleneck.Tosummarize,inadditiontosigni cantlyloweredDRAMac-cessenergies,SSAoccasionallycanboostperformance,whileyieldingminorperformanceslowdownsforothers.Weex-pectSSAtoyieldevenhigherimprovementsinthefutureasevermorecoresexerthigherqueuingpressuresonmemorycontrollers.Figure13showstheIPCdegradationcausedwhenweaugmentSSAwithourchipkillsolution.Notethatthisisentirelybecauseoftheincreasedbankcontentiondur-ingwrites.Onaverage,theincreaseinmemorylatencyisalittleover70%,resultingina12%degradationinIPC.Comparedtothenon-chipkillSSA,thereisalsoadditionalenergyconsumptiononeverywrite,resultingina2.2Xin-creaseindynamicenergytoprovidechipkill-levelreliability,whichisstillsigni cantlylowerthanabaselineorganization.4.2.3SystemLevelCharacteristicsToevaluatethesystemlevelimpactofourschemes,weuseasimplemodelwheretheDRAMsubsystemconsumes40%oftotalsystempower(32%dynamicand8%background).Changesinperformanceareassumedtolinearlyimpactthepowerconsumptionintherestofthesystem,bothback-groundanddynamic.Havingtakentheseintoaccount,onaverage,wesee18%and36%reductionsinsystempowerwithSBAandSSArespectively.5.RELATEDWORKThesigni cantcontributionofDRAMtooverallsystempowerconsumptionhasbeendocumentedinseveralstud-ies[10,32,37].Amajorityoftechniquesaimedatconserv-ingDRAMenergytrytotransitioninactiveDRAMchipstolowpowerstates[30]ase ectivelyaspossibletodecreasethebackgroundpower.Researchershaveinvestigatedpre-dictionmodelsforDRAMactivity[19],adaptivememorycontrollerpolicies[23],compiler-directedhardware-assisteddatalayout[16],managementofDMAandCPUgeneratedrequeststreamstoincreaseDRAMidleperiods[22,43]aswellasmanagingthevirtualmemoryfootprintandphysi-calmemoryallocationschemes[14,17,21]totransitionidleDRAMdevicestolowpowermodes.ThethemeoftheothermajorvolumeofworkaimedatDRAMpowerreductionhasinvolvedrank-subsetting.Inadditiontoexploitinglow-powerstates,thesetechniquesat-tempttoreducethedynamicenergycomponentofanac-cess.Zhengetal.suggestthesubdivisionofaconventional DRAMrankintomini-ranks[56]comprisingofasubsetofDRAMdevices.Ahnetal.[7,8]proposeaschemewhereeachDRAMdevicecanbecontrolledindividuallyviaade-muxregisterperchannelthatisresponsibleforroutingallcommandsignalstotheappropriatechip.Intheirmulti-coreDIMMproposal,multipleDRAMdevicesonaDIMMcanbecombinedtoformaVirtualMemoryDevice(VMD)andacachelineissuppliedbyonesuchVMD.Theyfurtherextendtheirworkwithacomprehensiveanalyticalmodeltoestimatetheimplicationsofrank-subsettingonperformanceandpower.TheyalsoidentifytheneedtohavemechanismsthatwouldensurechipkilllevelreliabilityandextendtheirdesignswithSCCDCDmechanisms.AsimilarapproachwasproposedbyWareetal.byemployinghigh-speedsignalstosendchipselectsseparatelytopartsofaDIMMinor-dertoachievedual/quadthreadedDIMMs[53].Ontheotherhand,Sudanetal.[47]attempttoimproverow-bu erutilizationbypackingheavilyusedcachelinesinto\Micro-Pages".OtherDRAM-relatedworkincludesdesignfor3-Darchi-tectures(Loh[36]),anddesignforsystemswithphotonicinterconnects(Vantreaseetal.[51]andBeameretal.[12]).YoonandErez[55]outlineecientchipkill-levelreliabilitymechanismsforDRAMsystemsbutworkwithexistingmi-croarchitecturesanddatalayouts.However,tothebestofourknowledge,ourworkisthe rsttoattemptfundamentalmicroarchitecturalchangestotheDRAMsystemspeci callytargetingreducedenergycon-sumption.OurSBAmechanismwithPosted-RASisanovelwaytoreduceactivationandcaneliminateoverfetch.TheSSAmechanismre-organizesthelayoutofaDRAMchiptosupportsmallsubarraysandthemappingofdatatoonlyactivateasinglesubarray.Ourchipkillsolutionthatuseschecksum-baseddetectionandRAID-likecorrectionhasnotbeenpreviousconsideredandismoree ectivethanthoseusedforpriorDRAMchipkillsolutions[8,18,35].6.CONCLUSIONSWeproposetwonoveltechniquestoeliminateoverfetchinDRAMsystemsbyactivatingonlythenecessarybit-lines(SBA)andthengoingasfarastoisolateanentirecachelinetoasinglesmallsubarrayonasingleDRAMchip(SSA).Oursolutionswillrequirenon-trivialinitialdesigne ortonthepartofDRAMvendorsandwillincurminorarea/costincreases.AsimilararchitecturewilllikelyalsobesuitableforemergingmemorytechnologiessuchasPCMandSTT-RAM.Thememoryenergyreductionsfromourtechniquesaresubstantialforbothdynamic(6X)andback-ground(5X)components.WeobservethatfetchingexactlyacachelinewithSSAcanimproveperformanceinsomecases(over50%onaverage)duetoitsclose-pagepolicyandalsobecauseithelpsalleviatebankcontentioninsomememory-sensitiveapplications.Inotherapplicationsthatarenotasconstrainedbybankcontention,theSSApolicycancauseperformancedegradations(13%onaverage)be-causeoflongcachelinedatatransfertimesoutofasingleDRAMchip.Anyapproachthatreducesthenumberofchipsusedtostoreacachelinealsoincreasestheprobabilityofcorrelatederrors.WithSSA,wereadanentirecachelineoutofasingleDRAMarray,sothepotentialforcorrelatederrorsisincreased.Inordertoprovidechipkill-levelreliabilityinconcertwithSSA,weintroducedchecksumsstoredforeachcachelineintheDRAM,similartothatprovidedinharddrives.Usingthechecksumwecanproviderobusterrordetectioncapabilities,andprovidechipkill-levelreliabilitythroughRAIDtechniques(howeverinourcase,weuseaRedundantArrayofInexpensiveDRAMs).Weshowthatthisapproachismoree ectiveintermsofareaandenergythanpriorchipkillapproaches,andonlyincursa12%perfor-mancepenaltycomparedtoanSSAmemorysystemwithoutchipkill.7.ACKNOWLEDGMENTSThisworkwassupportedinpartsbyNSFgrantsCCF-0430063,CCF-0811249,CCF-0916436,NSFCAREERawardCCF-0545959,SRCgrant1847.001,andtheUniversityofUtah.TheauthorswouldalsoliketothankUtahArchgroupmembersKshitijSudan,ManuAwasthi,andDavidNellansforhelpwiththebaselineDRAMsimulator.8.REFERENCES[1]CACTI:AnIntegratedCacheandMemoryAccessTime,CycleTime,Area,Leakage,andDynamicPowerModel.http://www.hpl.hp.com/research/cacti/.[2]HPAdvancedMemoryProtectionTechnologies-TechnologyBrief.http://www.hp.com.[3]MicronSystemPowerCalculator.http://www.micron.com/support/partinfo/powercalc.[4]STREAM-SustainableMemoryBandwidthinHighPerformanceComputers.http://www.cs.virginia.edu/stream/.[5]VirtutechSimicsFullSystemSimulator.http://www.virtutech.com.[6]M.Abbottetal.DurableMemoryRS/6000SystemDesign.InProceedingsofInternationalSymposiumonFault-TolerantComputing,1994.[7]J.Ahn,J.Leverich,R.S.Schreiber,andN.Jouppi.MulticoreDIMM:anEnergyEcientMemoryModulewithIndependentlyControlledDRAMs.IEEEComputerArchitectureLetters,vol.7(1),2008.[8]J.H.Ahn,N.P.Jouppi,C.Kozyrakis,J.Leverich,andR.S.Schreiber.FutureScalingofProcessor-MemoryInterfaces.InProceedingsofSC,2009.[9]D.Baileyetal.TheNASParallelBenchmarks.InternationalJournalofSupercomputerApplications,5(3):63{73,Fall1991.[10]L.Barroso.ThePriceofPerformance.Queue,3(7):48{53,2005.[11]L.BarrosoandU.Holzle.TheDatacenterasaComputer:AnIntroductiontotheDesignofWarehouse-ScaleMachines.Morgan&Claypool,2009.[12]S.Beameretal.Re-ArchitectingDRAMMemorySystemswithMonolithicallyIntegratedSiliconPhotonics.InProceedingsofISCA,2010.[13]C.Benia,S.Kumar,J.P.Singh,andK.Li.ThePARSECBenchmarkSuite:CharacterizationandArchitecturalImplications.Technicalreport,DepartmentofComputerScience,PrincetonUniversity,2008.[14]P.Burnsetal.DynamicTrackingofPageMissRatioCurveforMemoryManagement.InProceedingsofASPLOS,2004. [15]V.CuppuandB.Jacob.Concurrency,Latency,orSystemOverhead:WhichHastheLargestImpactonUniprocessorDRAM-SystemPerformance.InProceedingsofISCA,2001.[16]V.Delaluzetal.DRAMEnergyManagementUsingSoftwareandHardwareDirectedPowerModeControl.InProceedingsofHPCA,2001.[17]V.Delaluzetal.Scheduler-basedDRAMEnergyManagement.InProceedingsofDAC,2002.[18]T.J.Dell.AWhitepaperontheBene tsofChipkill-CorrectECCforPCServerMainMemory.Technicalreport,IBMMicroelectronicsDivision,1997.[19]X.Fan,H.Zeng,andC.Ellis.MemoryControllerPoliciesforDRAMPowerManagement.InProceedingsofISLPED,2001.[20]J.L.HennessyandD.A.Patterson.ComputerArchitecture:AQuantitativeApproach.Elsevier,4thedition,2007.[21]H.Huang,P.Pillai,andK.G.Shin.DesignAndImplementationOfPower-AwareVirtualMemory.InProceedingsOfTheAnnualConferenceOnUsenixAnnualTechnicalConference,2003.[22]H.Huang,K.Shin,C.Lefurgy,andT.Keller.ImprovingEnergyEciencybyMakingDRAMLessRandomlyAccessed.InProceedingsofISLPED,2005.[23]I.HurandC.Lin.AComprehensiveApproachtoDRAMPowerManagement.InProceedingsofHPCA,2008.[24]E.Ipek,O.Mutlu,J.Martinez,andR.Caruana.SelfOptimizingMemoryControllers:AReinforcementLearningApproach.InProceedingsofISCA,2008.[25]K.Itoh.VLSIMemoryChipDesign.Springer,2001.[26]ITRS.InternationalTechnologyRoadmapforSemiconductors,2007Edition.http://www.itrs.net/Links/2007ITRS/Home2007.htm.[27]B.Jacob,S.W.Ng,andD.T.Wang.MemorySystems-Cache,DRAM,Disk.Elsevier,2008.[28]M.Kumanoyaetal.AnOptimizedDesignforHigh-PerformanceMegabitDRAMs.ElectronicsandCommunicationsinJapan,72(8),2007.[29]O.La.SDRAMhavingpostedCASfunctionofJEDECstandard,2002.UnitedStatesPatent,Number6483769.[30]A.Lebeck,X.Fan,H.Zeng,andC.Ellis.PowerAwarePageAllocation.InProceedingsofASPLOS,2000.[31]C.Lee,O.Mutlu,V.Narasiman,andY.Patt.Prefetch-AwareDRAMControllers.InProceedingsofMICRO,2008.[32]C.Lefurgyetal.Energymanagementforcommercialservers.IEEEComputer,36(2):39{48,2003.[33]K.Limetal.UnderstandingandDesigningNewServerArchitecturesforEmergingWarehouse-ComputingEnvironments.InProceedingsofISCA,2008.[34]K.Limetal.DisaggregatedMemoryforExpansionandSharinginBladeServers.InProceedingsofISCA,2009.[35]D.Locklear.ChipkillCorrectMemoryArchitecture.Technicalreport,Dell,2000.[36]G.Loh.3D-StackedMemoryArchitecturesforMulti-CoreProcessors.InProceedingsofISCA,2008.[37]D.Meisner,B.Gold,andT.Wenisch.PowerNap:EliminatingServerIdlePower.InProceedingsofASPLOS,2009.[38]MicronTechnologyInc.MicronDDR2SDRAMPartMT47H256M8,2006.[39]N.Muralimanohar,R.Balasubramonian,andN.Jouppi.OptimizingNUCAOrganizationsandWiringAlternativesforLargeCacheswithCACTI6.0.InProceedingsofMICRO,2007.[40]O.MutluandT.Moscibroda.Stall-TimeFairMemoryAccessSchedulingforChipMultiprocessors.InProceedingsofMICRO,2007.[41]O.MutluandT.Moscibroda.Parallelism-AwareBatchScheduling:EnhancingBothPerformanceandFairnessofSharedDRAMSystems.InProceedingsofISCA,2008.[42]U.Nawatheetal.An8-Core64-Thread64bPower-EcientSPARCSoC.InProceedingsofISSCC,2007.[43]V.Pandey,W.Jiang,Y.Zhou,andR.Bianchini.DMA-AwareMemoryEnergyManagement.InProceedingsofHPCA,2006.[44]B.Rogersetal.ScalingtheBandwidthWall:ChallengesinandAvenuesforCMPScaling.InProceedingsofISCA,2009.[45]V.Romanchenko.Quad-CoreOpteron:ArchitectureandRoadmaps.http://www.digital-daily.com/cpu/quad core opteron.[46]B.Schroeder,E.Pinheiro,andW.Weber.DRAMErrorsintheWild:ALarge-ScaleFieldStudy.InProceedingsofSIGMETRICS,2009.[47]K.Sudan,N.Chatterjee,D.Nellans,M.Awasthi,R.Balasubramonian,andA.Davis.Micro-Pages:IncreasingDRAMEciencywithLocality-AwareDataPlacement.InProceedingsofASPLOS-XV,2010.[48]R.Swinburne.IntelCorei7-NehalemArchitectureDive.http://www.bit-tech.net/hardware/2008/11/03/intel-core-i7-nehalem-architecture-dive/.[49]S.Thoziyoor,N.Muralimanohar,andN.Jouppi.CACTI5.0.Technicalreport,HPLaboratories,2007.[50]U.S.EnvironmentalProtectionAgency-EnergyStarProgram.ReportToCongressonServerandDataCenterEnergyEciency-PublicLaw109-431,2007.[51]D.Vantreaseetal.Corona:SystemImplicationsofEmergingNanophotonicTechnology.InProceedingsofISCA,2008.[52]D.Wangetal.DRAMsim:AMemory-SystemSimulator.InSIGARCHComputerArchitectureNews,volume33,September2005.[53]F.A.WareandC.Hampel.ImprovingPowerandDataEciencywithThreadedMemoryModules.InProceedingsofICCD,2006.[54]D.Wentzla etal.On-ChipInterconnectionArchitectureoftheTileProcessor.InIEEEMicro,volume22,2007.[55]D.YoonandM.Erez.VirtualizedandFlexibleECCforMainMemory.InProceedingsofASPLOS,2010.[56]H.Zhengetal.Mini-Rank:AdaptiveDRAMArchitectureForImprovingMemoryPowerEciency.InProceedingsofMICRO,2008.