/
AL ANDSCAPE OF THE EW ARK ILICON ESIGN EGIME AL ANDSCAPE OF THE EW ARK ILICON ESIGN EGIME

AL ANDSCAPE OF THE EW ARK ILICON ESIGN EGIME - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
437 views
Uploaded On 2014-11-30

AL ANDSCAPE OF THE EW ARK ILICON ESIGN EGIME - PPT Presentation

AL AN ID: 18895

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AL ANDSCAPE OF THE EW ARK ILICON ESIGN E..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

slopebutslowswitchingtimes.BothTFETsandNEMSdevicesthushintatorders-of-magnitudeimprovementsinleakagebutremainuntamedandfallshortofbeingintegratedintorealchips.Realizingtheimportanceofthefourthhorseman,arecent$194millionDARPA/MARCOSTARnetprogramisfundingfourcenters,eachfocusingonakeydirectionforbeyond-CMOSapproaches:developingelectronspin-basedmemorycomputationdevices(C-SPIN),formulatingnewin-formation-processingmodelsthatcanlever-agestatistical(thatis,nondeterministic) L1L1 CCCInternal state interfaceC-coreC-coreC-coreC-core Figure2.TheGreenDroidarchitecture,anexampleofacoprocessor-dominatedarchitecture(CoDA).TheGreenDroidMobileApplicationProcessorcomprises16nonidenticaltiles(a).Eachtile(b)holdscomponentscommontoeverytile—theCPU,on-chipnetwork(OCN),andsharedlevel-1(L1)datacache—andprovidesspaceformultipleconservationcores,orc-cores,ofvarioussizes.Avarietyofin-tilenetworks(c)connectcomponentsandc-cores.............................................................. increasesindesign,verification,andpro-grammingeffortfortheseCoDAs.Combat-ingtheTowerofBabelproblemrequiresdefininganewparadigmforhowspecializa-tionisexpressedandexploitedinfuturepro-cessingsystems.Weneednewscalablearchitecturalschemasthatemploypervasivelyspecializedhardwaretominimizeenergyandmaximizeperformancewhileatthesametimeinsulatingthehardwaredesignerandprogrammerfromsuchsystems’underlyingcomplexity.OvercomingAmdahl-imposedlimitsonspecialization.Amdahl’slawprovidesanad-ditionalroadblockforspecialization.Tosaveenergyacrossthemajorityofthecom-putation,wemustfindbroad-basedspecial-izationapproachesthatapplytobothregular,parallelcodeandirregularcode.Wemustalsoensurethatcommunicatingspecializedprocessorsdoesn’tfritterawaytheirenergysavingsoncostlycross-chipcommunicationorshared-memoryaccesses.Recentefforts.TheUCSDGreenDroidprocessor(seeFigure2)3,15isonesuchCoDA-basedsystemthatseekstoaddressbothcomplexityissuesandAmdahllimits.GreenDroidisamobileapplicationprocessorthatimplementsAndroidmobileenvironmenthotspotsusinghundredsofspecializedcoresconservationcores,orc-cores.1,9whichtargetbothirregularandregularcode,areautomaticallygeneratedfromCorCsourcecode,andsupportapatchingmecha-nismthatletsthemtracksoftwarechanges.Theyattainanestimated8to10energy-efficiencyimprovement,atnolossinserialperformance,evenonnonparallelcode,andwithoutanyuserorprogrammerintervention.UnlikeNTVprocessors,c-coresneednotfindadditionalparallelismintheworkloadtocoveraserialperformanceloss.Thus,c-coresarelikelytoworkacrossawiderrangeofwork-loads,includingcollectionsofserialprograms.However,forhighlyparallelworkloadsinwhichexecutiontimeislooselyconcentrated,NTVprocessorsmightholdanareaadvantagebecauseoftheirreconfigurability.OtherspecializedprocessorssuchastheUniversityofWisconsin-Madison’sDySERandtheUniversityofMichigan’sBeretproposealternativearchitecturesthatexploitspecializationlikec-cores,butfocusonimprovingreconfigurabilityatthecostofenergysavings.Recenteffortshavealsoexaminedtheuseofapproximateneural-network-basedcomputingasanelegantwaytopackageprogrammability,reconfi-gurability,andspecialization.The‘‘deusexmachina’’horsemanOfthefourhorsemen,thisisbyfarthemostunpredictable.‘‘Deusexmachina’’referstoaplotdeviceinliteratureortheaterinwhichtheprotagonistsseemincreasinglydoomeduntiltheverylastmoment,whensomethingcompletelyunexpectedcomesoutofnowheretosavetheday.Fordarksil-icon,onedeusexmachinawouldbeabreak-throughinsemiconductordevices.However,asweshallsee,thebreakthroughsthatwouldberequiredwouldhavetobequitefunda-mental—infact,wemostlikelywouldhavetobuildtransistorsoutofdevicesotherthanMOSFETs.Why?BecauseMOSFETleakageissetbyfundamentalprinciplesofdevicephysics,andislimitedtoasubthresh-oldslopeof60mV/decadeatroomtemper-ature;thiscorrespondstoareductionof10leakagecurrentforevery60mVthatthethresholdvoltageisabovethess,whichisdeterminedbypropertiesofthermionicemissionofcarriersacrossapotentialwell.Thus,althoughinnovationssuchasIntel’sFinFET/TriGatetransistorandhigh-dielectricsrepresentsignificantachievementsmaintainingasubthresholdslopeclosetotheirhistoricalvalues,theystillremainwith-inthescopeoftheMOSFET-imposedlimitsandareone-timeimprovementsratherthanscalablechanges.TwoVLSIcandidatesthatbypasstheselimitsbecausetheyarenotbasedonthermalinjectionaretunnelfield-effecttransistorswhicharebasedontunnelingeffects,andnanoelectromechanicalsystem(NEMS)switches,whicharebasedonphysicalrelays.TFETsarereputedtohavesubthresholdslopesontheorderof30mV/decade—twiceasgoodastheidealMOSFET—butwithloweron-currentsthanMOSFETs,limitingtheiruseinhigh-performancecircuits.NEMSdeviceshaveessentiallyanear-zerosubthreshold.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON subsetofcachetransistors(suchasaword-line)isaccessedeachcycle,cachememorieshavelowdutycyclesandthusareinherentlydark.Comparedtogeneral-purposelogic,alevel-1(L1)cacheclockedatitsmaximumfrequencycanbeabout10darkerpersquaremillimeter,andlargercachescanbeevendarker.Thus,addingcacheisonewaytosimultaneouslyincreaseperformanceandlowerpowerdensitypersquaremillimeter.Wecanimagine,forinstance,expandingper-corecacheataratethatsoaksuptheremainingdarksiliconarea:1.4to2morecachepercorepergeneration.How-ever,manyapplicationsdonotbenefitmuchfromadditionalcache,andupcomingTSV-integratedDRAMwillreducethecachebenefitforthoseapplicationsthatdo.ComputationalsprintingandTurboOthertechniquesemploy‘‘temporaldimness’’asopposedto‘‘spatialdimness,’’temporarilyexceedingthenominalthermalbudgetbutrelyingonthermalcapacitancetobufferagainsttemperatureincreases,andthenrampingbacktoacomparativelydarkstate.Intel’sTurboBoost2.0usesthisapproachtoboostperformanceupuntiltheprocessorreachesnominaltemperature,rely-ingontheheatsink’sinnatethermalcapaci-tance.ARM’sbig.LITTLEemploysfourA15coresuntilthethermalenvelopeisexceeded(anecdotally,about10seconds),thenswitchesovertofourlower-energy,lower-performanceA7cores.Computationalsprintingcarriesthisastepfurther,employ-ingphase-changematerialsthatletchipsex-ceedtheirsustainablethermalbudgetbyanorderofmagnitudeforseveralseconds,pro-vidingashortbutsubstantialcomputationalboost.Thesemodesareespeciallyusefulfor‘‘racetofinish’’computations,suchasweb-pagerendering,forwhichresponselatencyisimportant,orforwhichspeedingupthetransitionofboththeprocessoranditssup-portlogictoalow-powerstatereducesen-ergyconsumption.ThespecializedhorsemanThespecializedhorsemanusesdarksili-contoimplementahostofspecializedco-processors,eacheithermuchfasterormuchmoreenergyefficient(100to1,000)thanageneral-purposeprocessor.hopsbetweencoprocessorsandgeneral-purposecores,executingwhereitismostef-ficient.Theunusedcoresarepower-andclock-gatedtokeepthemfromconsumingpreciousenergy.Unlikedimsilicon,whichtendstofocusonmanipulatingvoltages,fre-quencies,anddutycyclesaswaystomanagepower,specializedlogicfocusesonreducingtheamountofcapacitancethatneedstobeswitchedtoperformaparticularoperation.Thepromiseforafutureofwidespreadspecializationisalreadybeingrealized:weareseeingaproliferationofspecializedaccel-eratorsthatspandiverseareassuchasbase-bandprocessing,graphics,computervision,andmediacoding.Theseacceleratorsenableorders-of-magnitudeimprovementsinen-ergyefficiencyandperformance,especiallyforcomputationsthatarehighlyparallel.Recentproposalshaveextrapolatedthistrendandanticipatethatthenearfuturewillseesystemscomprisingmorecoproces-sorsthangeneral-purposeprocessors.1,7Thisarticlereferstothesesystemsascoprocessor-dominatedarchitectures,orCoDAs.Asspecializationusagegrowstocombatthedarksiliconproblem,wearefacedwithamodern-dayspecialization‘‘TowerofBabel’’crisisthatfragmentsournotionofgeneral-purposecomputationandeliminatesthetraditionalclearlinesofcommunicationbetweenprogrammersandsoftwareandtheunderlyinghardware.Already,weseethedeploymentofspecializedlanguagessuchasCUDAthatarenotusablebetweensimilararchitectures(forexample,AMDandNvi-dia).Weseeoverspecializationproblemsbe-tweenacceleratorsthatcausethemtobecomeinapplicabletocloselyrelatedclassesofcom-putations(suchasdouble-precisionscientificcodesrunningincorrectlyonaGPU’snon-IEEE-compliantfloating-pointhardware).Adoptionproblemsarealsocausedbytheex-cessivecostsofprogrammingheterogeneoushardware(suchastheslowuptakeofSonyPlayStation3versusXbox).Moreover,spe-cializedhardwarerisksobsolescenceasstan-dardsarerevised(forexample,aJPEGstandardrevision).Insulatinghumansfromcomplexity.factorsspeaktopotentialexponential............................................................. digitally.However,analogtechniquesmightnotscalewelltodeepnanometertechnology.Fast,static,‘‘gather,reduce,andbroad-cast’’operators.Neuronshavefanoutandfaninofapproximately7,000tootherneuronsthatarelocatedsignifi-cantdistancesaway.Effectively,theycanperformefficientoperationsthatcombinevector-stylegathermemoryaccessestolargenumbersofstatic-memorylocations,withavector-stylereductionoperatorandabroadcast.Domoreefficientwaysexistforimple-mentingtheseoperationsinsilicon?Itcouldbeusefulforcomputationsthatoperateonfinite-sizedstaticgraphs.Recently,boththeEUandUSgovern-mentshaveproposedinitiativestoenablegreaterstudiesofthecomputationalcapabil-itiesofthebrain.Althoughbrain-inspiredcomputinghasalreadycomeandgonesev-eraltimesinthebriefhistoryofmanmadecomputers,darksiliconmaycausetheseapproachestobecomeincreasinglyrelevant.lthoughsiliconisgettingdarker,forresearchersthefutureisbrightandex-citing.Darksiliconwillcauseatransforma-tionofthecomputationalstackandprovidemanyopportunitiesforinvestigation.MICROAcknowledgmentsThisworkwaspartiallysupportedbyNSFawards0846152,1018850,0811794,and1228992,NokiaandAMDgifts,andbySTARnet,anSRCprogramsponsoredbyMARCOandDARPA.Ithanktheanon-ymousreviewersfortheirvaluableinsightsandsuggestions.....................................................................References1.G.Venkateshetal.,‘‘ConservationCores:ReducingtheEnergyofMatureComputa-tions,’’Proc.15thArchitecturalSupportforProgrammingLanguagesandOp-eratingSystemsConf.,ACM,2010,pp.205-218.2.R.Merrit,‘‘ARMCTO:PowerSurgeCouldCreate‘DarkSilicon,’’’EETimes,22Oct.3.N.Gouldingetal.,‘‘GreenDroid:AMobileApplicationProcessorforaFutureofDarkSilicon,’’HotChipsSymp.,2010.4.M.Taylor,‘‘IsDarkSiliconUseful?Harness-ingtheFourHorsemenoftheComingDarkSiliconApocalypse,’’Proc.49thAnn.DesignAutomationConf.(DAC12),ACM,2012,pp.1131-1136.5.R.H.Dennard,‘‘DesignofIon-ImplantedMOSFET’swithVerySmallPhysicalDimen-sions,’’IEEEJ.Solid-StateCircuits,vol.SC-9,1974,pp.256-268.6.H.Esmaeilzadehetal.,‘‘DarkSiliconandtheEndofMulticoreScaling,’’ACMSIGARCHComputerArchitectureNews,vol.39,no.3,2011,pp.365-376.7.N.Hardavellasetal.,‘‘TowardDarkSiliconinServers,’’IEEEMicro,vol.31,no.4,2011,pp.6-15.8.W.Huangetal.,‘‘ScalingwithDesignCon-straints:PredictingtheFutureofBigChips,’’IEEEMicro,vol.31,no.4,2011,pp.16-29.9.J.Sampsonetal.,‘‘EfficientComplexOper-atorsforIrregularCodes,’’Proc.17thIntlSymp.HighPerformanceComputerArchi-(HPCA11),IEEECS,2011,pp.491-502.10.A.Raghavanetal.,‘‘ComputationalSprint-ing,’’Proc.IEEE18thIntlSymp.High-PerformanceComputerArchitecture12),IEEECS,2012,doi:10.1109/HPCA.2012.6169031.11.R.Dreslinskietal.,‘‘Near-ThresholdCom-puting:ReclaimingMoore’sLawThroughEnergyEfficientIntegratedCircuits,’’Proc.vol.98,no.2,2010,pp.253-266.12.E.Krimeretal.,‘‘Synctium:ANear-ThresholdStreamProcessorforEnergy-ConstrainedParallelApplications,’’IEEEComputerArchi-tectureLetters,Jan.2010,pp.21-24.13.D.Ficketal.,‘‘Centip3de:A3930DMIPS/WConfigurableNear-Threshold3DStackedSystemwith64ARMCortex-M3Cores,’’Proc.IEEEIntlSolid-StateCircuitsConf.,IEEE,2012,pp.190-192.14.S.Jainetal.,‘‘A280mV-to-1.2VWide-Operating-RangeIA-32Processorin32nmProc.IEEEIntlSolid-StateCircuitsIEEE,2012,pp.66-68.15.N.Goulding-Hottaetal.,‘‘TheGreenDroidMobileApplicationProcessor:AnArchitec-tureforSilicon’sDarkFuture,’’IEEEMicro,vol.31,no.2,2011,pp.86-95..............................................................IEEEMICRO...............................................................................................................................................................................................ILICON siliconthatisnotusedallthetime,oratitsfullfrequency.EvenduringthebestdaysofCMOSscaling,microprocessorandothercircuitswerechockfullof‘‘darklogic’’usedinfrequentlyorforonlysomeapplica-tions—forinstance,cachesareinherentlydarkbecausetheaveragecachetransistorisswitchedforfarlessthanonepercentofcycles,andFPUsremaindarkinintegercodes.Soon,theexponentialgrowthofdarksil-iconareawillpushusbeyondlogictargetedfordirectperformancebenefitstowardswathsoflow-dutycyclelogicthatexists,notfordirectperformancebenefit,butforimprovingenergyefficiency.Thisimprovedenergyefficiencycanthenallowanindirectperformanceimprovementbecauseitfreesupmoreofthefixedpowerbudgettobeusedforevenmorecomputation.ThefourhorsemenRecently,researchersproposedataxon-omy—thefourhorsemen—thatidentifiesfourpromisingdirectionsfordealingwithdarksiliconthathaveemergedaspromisingpotentialapproachesaswetransitionbeyondtheinitialmulticorestop-gapsolution.Theseresponsesoriginallyappearedtobeunlikelycandidates,carryingunwelcomeburdensindesign,manufacturing,orprogramming.Noneisidealfromanaestheticengineering 4 cores at 1.8 GHz 4 cores at 2(12 cores dark) 4 cores at 1.8 GHz(8 cores dark, 8 dim) (Industry’s choice) 65 nm32 nmSpectrum of trade-offsbetween no. of cores andfrequency Figure1.Multicorescalingleadstolargeamountsofdarksilicon.Acrosstwoprocessgen-erations,thereisaspectrumoftrade-offsbetweenfrequencyandcorecount;theseincludeincreasingcorecountby2butleavingfrequencyconstant(top),andincreasingfrequencyby2butleavingcorecountconstant(bottom).Anyofthesetrade-offpointswillhavelargeamountsofdarksilicon. ...............................................................................................................................................................................................IsDarkSiliconReal?ARealityCheckAquicksurveyofrecentdesignsfrommulticoreoutfitssuchasTilera,Intel,andAMDindicatesthatindustryhaspursuedcorecountandfre-quencycombinationsconsistentwiththeutilizationwall.Forinstance,Intel’s90-nmsingle-corePrescottchipranat3.8GHzin2004.Dennardscalingwouldsuggestthata22-nmmulticoreversionshouldrunat15.5GHz,andcontain17superscalarcores,foratotalimprovementof69instructionthroughput.Instead,theupcoming201322-nmIntelCorei74960Xrunsat3.6GHzandhassixsuperscalarcores,a5.7peakserialinstructionthroughputimprovement.Thedarknessratioisthus91.74per-centversusthe93.75percentpredictedbytheutilizationwall.Thelatest2012InternationalTechnologyRoadmapforSemiconductorsalsoshowsthatscalinghasproceededconsistentlywithpost-Dennardpredictions..............................................................IEEEMICRO...............................................................................................................................................................................................ILICON architectureisstrategicallymanagingthechip-widetransistordutycycletoenforcetheoverallpowerconstraint.8,9Whereasearly90-nmdesignssuchasCellandPre-scottweredimmedbecauseactualpowerexceededdesign-estimatedpower,weareconvergingonincreasinglymoreelegantmethodsthatmakebettertrade-offs.Dimsilicontechniquesincludedynami-callyvaryingthefrequencywiththenumberofcoresbeingused,scalinguptheamountofcachelogic,employingnear-thresholdvolt-age(NTV)processordesigns,andredesign-ingthearchitecturetoaccommodateburststhattemporarilyallowthepowerbudgettobeexceeded,suchasTurboBoostandcom-putationalsprinting.TurboBoost1.0.Althoughfirst-generationmulticoreshadaship-time-determinedtopfrequencythatwasinvariantofthenumberofcurrentlyactivecores,Intel’sTurboBoost1.0enabledsecond-generationmulti-corestomakereal-timetrade-offsbetweenactivecorecountandthefrequencythecoresranat:thefewerthecores,thehigherthefrequency.WhenTurboBoostisenabled,itusestheenergygainedfromturningoffcorestoincreasethevoltageandthenthefre-quencyoftheactivecores.Thistechnique,knownasdynamicvoltageandfrequencyscaling(DVFS),increasespowerproportionaltothecubeoftheincreaseinfrequency.NTVprocessors.Inthepast,DVFSwasalsousedtosavecubicpowerwhenfrequenciesweredecreased.However,today,processormanufacturersoperatetransistorsatreducedvoltages—around2.5thethresholdvolt-age,anenergy-delayoptimalpoint.Thispointisrightattheedgeofanoperatingre-gimewherefrequencystartstodropprecipi-touslyasvoltageisreduced,whichmakesdownward-DVFSmuchlesseffective.Nonetheless,researchershavebeguntoexplorethisregime.OnerecentapproachisNear-ThresholdVoltage(NTV)logic,whichoperatestransistorsinthenear-thres-holdregimeslightlyabovethethresholdvolt-age,providingmorepalatabletrade-offsbetweenenergyanddelaythansubthresholdcircuits,forwhichfrequencydropsexponen-tiallywithvoltagedecreases.Researchershaveexploredwide-SIMDNTVprocessors,whichseektoexploitdataparallelism,alongwithNTVmany-coreprocessorsandanNTVx86processor.AlthoughNTVper-processorperformancedropsfasterthanthecorrespondingsavingsinenergy-per-instruction(5energyimprove-mentforan8performancecost),theperfor-mancelosscanbeoffsetbyusing8moreprocessorsinparalleliftheworkloadallowsit.Then,anadditional5processorscouldturntheenergyefficiencygainsintoadditionalperformance.So,withidealparallelization,NTVcouldoffer5thethroughputim-provementbyabsorbing40thearea.Butthiswouldalsorequire40morefreeparal-lelismintheworkloadrelativetotheparallel-ismconsumedbyanequivalentenergy-limitedsuper-thresholdmany-coreprocessor.Inpractice,formanyapplications,40additionalparallelismcanbeelusive.Forchipswithlargepowerbudgetsthatcanal-readysustainhundredsofcores,applicationsthathavethismuchspareparallelismarerel-ativelyrare.Interestingly,becauseofthisef-fect,NTV’sapplicabilityacrossapplicationsincreasesinlow-energyenvironmentsbecausetheenergy-limitedbaselinesuper-thresholddesignhasconsumedlessoftheavailablepar-allelism.Furthermore,NTVclearlybecomesmoreapplicableforworkloadswithextremelylargeamountsofparallelism.NTVpresentsseveralcircuit-relatedchal-lengesthathaveseenactiveinvestigation,es-peciallybecausetechnologyscalingwillexacerbateratherthanamelioratethesefactors.AsignificantNTVchallengehasbeensuscep-tibilitytoprocessvariability.Asoperatingvol-tagesdrop,variationintransistorthresholdduetorandomdopantfluctuationispropor-tionallyhigher,andleakageandoperatingfre-quencycanvarygreatly.BecauseNTVdesignscanexpandtheareaconsumptionbyapproximately8ormore,variationissuesareexacerbated.Otherchallengesincludethepenaltiesinvolvedindesigninglow-operatingvoltagestaticRAMs(SRAMs)andtheincreasedinterconnectionenergyconsump-tionduetogreaterspreadingacrosscores.Biggercaches.Anoften-proposeddim-siliconalternativeistosimplyallocateotherwisedarksiliconareaforcaches.Becauseonlya.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON heterogeneity,becausedesignswerelargelymeasuredaccordingtoasingleaxis—performance.Tofirstorder,therewasasingleoptimaldesignpoint.Nowthatperformanceandenergyarebothimportant,aParetocurvetradesoffper-formanceandenergy,andthereisnooneoptimaldesignacrossthatcurve;therearemanyoptimalpoints.Optimaldesignswillincorporateseveralsuchpointsacrossthesecurves.Theserulesofthumbwillguideourexist-ingdesignsalonganevolutionarypathtobe-comeincreasinglydarksiliconfriendly—butwhatthenofmorerevolutionaryapproaches?Insightsfromthebrain:adarktechnologyPerhapsonepromisingindicatorthatlow-dutycycle,‘‘darktechnology’’canbemas-tered,unlockingnewapplicationdomains,istheefficiencyanddensityofthehumanbrain.Thebrain,eventoday,canperformmanytasksthatcomputerscannot,especiallyvision-relatedtasks.With80billionneuronsand100trillionsynapsesoperatingatlessthan100mV,thebrainembodiesanexistenceproofofhighlyparallel,reliable,anddarkoperation,andembodiesthreeofthehorsemen—dim,specialized,anddeusexmachina.Neuronsoperatewithextremelylow-dutycyclescomparedtoprocessors—atbest,1kilohertz.Althoughcomputingwithsil-icon-simulatedneuronsintroducesexcessive‘‘interpretive’’overheads—neuronsandtransis-torshavefundamentallydifferentproperties—thebraincanofferusinsightandlong-termideasabouthowwecanredesignsystemsfortheextremelylow-dutycyclesandlowvoltagescalledforbydarksilicon.Herearesomeoftheseproperties,whichmaygiveusinsightonmorerevolutionaryextensionstotheevolu-tionaryprinciplesproposedinthelastsection:Specialization.Aswiththespecializedhorseman,differentgroupsofneuronsservedifferentfunctionsincognitiveprocessing,connecttodifferentsensoryorgans,andallowreconfiguration,evolvingwithtimesynapticconnec-tionscustomizedtothecomputation.Verydarkoperation.Neuronsfireatamaximumrateofapproximately1,000switchespersecond.Comparethistoarithmeticlogicunit(ALU)transistorsthattoggleatthreebilliontimespersecond.Themostactiveneuron’sactiv-ityisamillionthofthatofprocessingtransistorsintoday’sprocessors.Low-voltageoperation.Braincellsoper-ateatapproximately100mV,yieldingenergysavingsof1001-Voperation,inaclearparalleltothedimhorseman’sNTVcircuits.Communicationislowswingandlowvoltage,savinglargeamountsofenergy.Limitedsharingandmemorymultiplex-ing.Anygivenneuroncanswitchonly1,000timespersecond,bydefinition,soitmusthaveextremelylimitedshar-ing,becauseapointofmultiplexingwouldbeabottleneckinparallelpro-cessing.Thehumanvisualsystemstartswith6Mconesintheretina,similartoa2-megapixeldisplay,processesitwithlocalneurons,andthensendsitonthe1M-neuronopticnervetothevisualcortex.Thereisnocentralmemorystore;eachpixelhasasetofitsownALUs,sotospeak,soenergywasteduetomultiplexingisminimal.Datadecimation.Thehumanbrainreducesthedatasizeateachstepandoperatesonconcisebutapproximaterepresentations.Ifusing2megapixelssufficestohandlecolor-relatedvisiontasks,whyusemorethanthat?Largersensorswouldjustrequiremoreneu-ronstostoreandcomputeonthedata.Weshouldensurethatwearepro-cessingnomoredatathannecessarytoachievethefinaloutcome.Analogoperation.Theneuronperformsamorecomplexbasicoperationthanthetypicaldigitaltransistor.Ontheinputside,neuronscombineinforma-tionfrommanyotherneurons;andontheoutput,despiteproducingrail-to-raildigitalpulses,encodemultiplebitsofinformationviaspikestimings.Couldthissuggestthattherearemoreefficientwaystomapoperationsontosilicon-basedtechnologies?InRFwire-lessfront-endcommunications,analogprocessingenablescomputationsthatwouldbeimpossibletodoatspeed............................................................. pointofview.Butthesuccessofcomplexmultiregimedevicessuchasmetal-oxide-semiconductorfield-effecttransistors(MOS-FETs)hasshownthatengineerscantoleratecomplexityiftheendresultisbetter.Futurechipsarelikelytoemploynotjustonehorse-man,butallofthem,ininterestinganduniquecombinations.TheshrinkinghorsemanWhenconfrontedwiththepossibilityofdarksilicon,manychipdesignersinsistthatareaisexpensive,andthattheywouldjustbuildsmallerchipsinsteadofhavingdarksil-iconintheirdesigns.Amongthefourhorse-men,these‘‘shrinkingchips’’arethemostpessimisticoutcome.Althoughallchipsmayeventuallyshrinksomewhat,theonesthatshrinkthemostwillbethoseforwhichdarksiliconcannotbeappliedfruit-fullytoimprovetheproduct.Thesechipswillrapidlyturnintolow-marginbusinessesforwhichfurthergenerationsofMoore’slawprovidesmallbenefit.Belowisanexam-inationofthespectrumofsecond-ordereffectsassociatedwithshrinkingchips.Costsideofshrinkingsilicon.Understandingshrinkingchipsrequiresconsideringsemi-conductoreconomics.The‘‘buildsmallerchips’’argumenthasaringoftruth;afterall,designersspendmuchoftheirtimetryingtomeetareabudgetsforexistingchipdesigns.Butexponentiallysmallerchipsarenotexponentiallycheaper;evenifsiliconbeginsas50percentofsystemcost,afterafewprocessgenerations,itwillbeatinyfrac-tion.Maskcosts,designcosts,andI/Opadareawillfailtobeamortized,leadingtoris-ingcostspermmofsilicon,whichulti-matelywilleliminateincentivestomovethedesigntothenextprocessgeneration.Thesedesignswillbe‘‘leftbehind’’onoldergenerations.Revenuesideofshrinkingsilicon.Shrinkingsiliconcanalsoshrinkthechipsellingprice.Inacompetitivemarket,ifthereisawaytousethenextprocessgeneration’sbountyofdarksilicontoattainabenefittotheendproduct,thencompetitionwillforcecompaniestodoso.Otherwise,theywillgenerallybeforcedintolow-end,low-margin,high-competitionmarkets,andtheircompetitorwilltakethehighendandenjoyhighmargins.Thus,inscenarioswheredarksiliconcouldbeusedprofitably,decreasingareainlieuofexploitingitwouldcertainlydecreasesystemcosts,butwouldcatastrophicallydecreasesaleprice.Hence,theshrinking-chipsscenarioislikelytohappenonlyifwecanfindnopracticalusefordarksilicon.Powerandpackagingissueswithshrinkingchips.Amajorconsequenceofexponentiallyshrinkingchipsisacorrespondingexponen-tialriseinpowerdensity.Recentanalysisofmany-corethermalcharacteristicshasshownthatpeakhotspottemperaturerisecanbemodeledasmaxTDPconv,wheremaxistheriseintemperature,TDPisthetargetchipthermaldesignpower,convistheheatsinkthermalconvectionre-sistance(lowerisabetterheatsink),incor-poratesmany-coredesignproperties,andchiparea.Ifareadropsexponentially,thesecondtermdominatesandchiptempera-turesriseexponentially.ThisinturnwillforcealowerTDPsothattemperaturelimitsaremet,andreducescalingbeloweventhenominal1.4expectedenergy-efficiencygain.Thus,ifthermalsdriveyourshrink-ing-chipstrategy,itismuchbettertoholdyourfrequencyconstantandincreasecoresby1.4withanetareadecreaseof1.4thanitistoincreaseyourfrequencyby1.4andshrinkyourchipby2ThedimhorsemanAsexponentiallylargerfractionsofachip’stransistorsbecomedarktransistors,siliconareabecomesanexponentiallycheaperre-sourcerelativetopowerandenergyconsump-tion.Thisshiftcallsfornewarchitecturaltechniquesthatspendareatobuyenergyeffi-ciency.Ifwemovepastunhappythoughtsofshrinkingsiliconandconsiderpopulatingdarksiliconareawithlogicthatweuseonlypartofthetime,thenweareledtosomein-terestingnewdesignpossibilities.Thetermdimsiliconreferstotechniquesthatputlargeamountsofotherwise-darksiliconareatoproductiveusebyemployingheavyunderclockingorinfrequentusetomeetthepowerbudget—thatis,the............................................................. beyond-CMOSdevices(SONIC),engineer-ingnonconventionalatomicscaleengineeredmaterials(FAME),andcreatingnewdevicesthatextendpriorworkonTFETstooperateatevenlowervoltages(LEAST).EvolutionarydesignprinciplesfordarkWhileresearchersworktomaturethenewideasrepresentedbythefourhorsemen,whatprinciplesshouldguidetoday’sdesignsthatmusttackledarksilicon?Listedbelowisasetofevolutionary,ratherthanrevolutionary,darksilicondesignprinciplesthataremoti-vatedbychangingtrade-offscreatedbydarksilicon:Movingtothenextgenerationwillpro-videanautomatic1.4energy-efficiencyincrease.Figureouthowyouwilluseit.Asabaseline,chipcapabilitieswillscalewithenergy,whetheritisallocatedtofrequencyormorecores.Youcanin-creaseordecreasefrequencyortransis-torcounts,buttransistorsswitchedperunittimecanincreasebyonly1.4Thenextgenerationwillcreatealargeamountofdarkarea.Determine,foryourdomain,howtotrademostlydarkareaforenergy.Ifthedieareaisfixed,anyscalingisgoingtohaveasurplusoftransistors.Whichcombinationofthefourhorsemenismosteffectiveinyourdomain?Shouldyougodim—morecaches?Underclockedarraysofcores?NTVontopofthat?Addaccel-eratorsorc-cores?Usenewkindsofde-vices?Shrinkyourchip?Pipeliningmakeslesssensethanitusedto.Figureoutiffastertransistordelayswillallowyoutofitmoreinapipelinestagewithoutreducingfrequency.Pipeliningincreasesdutycycleandintroducesaddi-tionalcapacitanceincircuits(registers,predictioncircuits,bypassing,andclocktreefanout),neitherofwhichisdarksil-iconfriendly.ReducingpipelinedepthandincreasingFO4depthsreducescapacitiveoverhead.Note,too,thatexces-sivepipeliningandfrequencyexacerbatesthegapbetweenprocessingandmemory.Architecturalmultiplexingandlogicshar-ingarebecomingincreasinglyquestionableoptimizations.Seeiftheystillmakesense.Sharingintroducesadditionalenergyconsumptionbecauseitrequiressharerstohavelongerwirestothesharedlogic,anditintroducesadditionalperfor-manceandenergyoverheadsfromthecontrollogicthatmanagesthesharing.Forexample,architecturesthathaverepositoriesofnonsharedstatethatsharephysicalpipelines(suchaslarge-scalemultithreading)paylargewirecapacitancesinsidethesememoriestosharethatstate.Asareagetscheaper,itwillmakelesssensetopaytheseover-heads,andthedegreeofsharingwillde-creasesothattheenergycostofpullingstateoutofthesestaterepositorieswillbereduced.MultiplexingandRAMsthatfacilitatesharingofprogramdataarestillagoodidea.Keepthem.Ifdifferentthreadsofcontrolaretrulysharingdata,multi-plexedstructures,suchassharedRAM,orcrossbars,areoftenstillmoreefficientthancoherenceprotocolsorotherschemes.Architecturaltechniquesforsavingtran-sistorsshouldonlybeappliediftheydonotworsenenergyefficiency.Transistorsaregettingexponentiallycheaper,andwecan’tusethemallatonce.Whyarewetryingtosavetransistors?Lo-cally,transistor-savingoptimizationsmakesense,butanexponentialwindisblowingagainsttheseoptimizationsinthelongrun.Powerrailsarethenewclocks.Designwiththeminmind.Tenyearsago,itwasabigsteptomovebeyondafewclockdomains.Now,chipscanhavehundredsofclockdomains,allwiththeirownclockgates.Withdarksilicon,wewillseethesameeffectwithpowerrails;wewillhavehundredsandmaybethousandsofpowerrailsinthefuture,allwiththeirownpowergates,tomanagetheleakageforthemanyhet-erogeneoussystemcomponents.Heterogeneityresultsfromtheshiftfroma1Dobjectivefunction(performance)toa2Dobjectivefunction(performanceandenergy).Designwiththeshapeofthisfunctioninmind.Thepastlackedin.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON . ..................................................................................................................................................................................................................ANDSCAPEOFTHE 16.V.Govindaraju,C.-H.Ho,andK.Sankaralin-gam,‘‘DynamicallySpecializedDatapathsforEnergyEfficientComputing,’’Proc.IEEE17thIntlSymp.High-PerformanceComputerArchitecture(HPCA11),IEEECS,2011,doi:10.1109/HPCA.2011.5749755.17.S.Guptaetal.,‘‘BundledExecutionofRe-curringTracesforEnergy-EfficientGeneralPurposeProcessing,’’Proc.44thAnn.IEEE/ACMIntlSymp.Microarchitecture,ACM,2011,pp.12-23.18.H.Esmaeilzadehetal.,‘‘NeuralAccelerationforGeneral-PurposeApproximatePro-grams,’’Proc.45thAnn.IEEE/ACMIntlSymp.Microarchitecture,IEEECS,2012,pp.449-460.19.A.Ionescuetal.,‘‘TunnelField-EffectTransis-torsasEnergy-EfficientElectronicSwitches,’’Nature,17Nov.2011,pp.329-337.20.F.Chenetal.,‘‘DemonstrationofIntegratedMicro-Electro-MechanicalSwitchCircuitsforVLSIApplications,’’Proc.IEEEIntlSolid-StateCircuitsConf.,IEEE,2010,pp.150-151.MichaelB.TaylorisanassociateprofessorintheDepartmentofComputerScienceandEngineeringattheUniversityofCalifornia,SanDiego,whereheleadstheCenterforDarkSilicon.Hisresearchinterestsincludedarksilicon,chipdesign,parallelizationtools,andBitcoincomputingsystems.TaylorhasaPhDinelectricalengineeringandcomputersciencefromtheMassachu-settsInstituteofTechnology.DirectquestionsandcommentsaboutthisarticletoMichaelB.Taylor,9500GilmanDrive,MC0404EBU3B3202,LaJolla,CA92093-0404;mbtaylor@ucsd.edu. ............................................................. solutiontodarksilicon;itismerelyindus-try’sinitial,transitionalresponsetotheshockingonsetofthedarksiliconage.In-creasinglyovertime,thesemiconductorin-dustryisadaptingtothisnewdesignregime,realizingthatmulticorechipswillnotscaleastransistorsshrinkandthatthefractionofachipthatcanbefilledwithcoresrunningatfullfrequencyisdroppingexponentiallywitheachprocessgenera-tion.1,3Thisrealityforcesdesignerstoensurethat,atanypointintime,largefractionsoftheirchipsareeffectivelydark—eitheridleforlongperiodsoftimeorsignificantlyunderclocked.Asexponentiallylargerfrac-tionsofachip’stransistorsbecomedarker,siliconareabecomesanexponentiallycheaperresourcerelativetopowerandenergyconsumption.Thisshiftcallsfornewarchi-tecturaltechniquesthat‘‘spend’’areato‘‘buy’’energyefficiency.Thissavedenergycanthenbeappliedtoincreaseperformance,ortohavelongerbatterylifeorloweroperat-ingtemperatures.TheutilizationwallthatcausesdarksiliconTable1showsthederivationoftheutiliza-tionwallthatcausesdarksilicon.employsascalingfactor,,whichistheratiobetweenthefeaturesizesoftwoprocesses(forexample,between32and22nm).InbothDennardandpost-Dennardscaling,thetransistorcountscales,andthetransistorswitchingfrequencyscalesbyThus,ournetincreaseincomput-ingperformanceis,or2.8However,tomaintainaconstantpowerenvelope,thesegainsmustbeoffsetbyacor-respondingreductionintransistorswitchingenergy.Inbothcases,scalingreducestransis-torcapacitanceby,improvingenergyeffi-ciencybyInDennardscaling,wecanscalethethresholdvoltageandthustheoper-atingvoltage,whichyieldsanotherenergy-efficiencyimprovement.However,intoday’spost-Dennard,leakage-limitedregime,wecannotscalethresholdvoltagewithoutexpo-nentiallyincreasingleakage,andasaresult,wemustholdoperatingvoltageroughlycon-stant.Theendresultisashortfallof,or2perprocessgeneration.Thisshortfallmulti-plieswitheachprocessgeneration,resultinginexponentiallydarkersiliconovertime.Thisshortfallpreventsmulticorefrombeingthesolutiontoscaling.1,3advancingasingleprocessgenerationwouldallowenoughtransistorstoincreasecorecountby2,andfrequencycouldbe1.4faster,theenergybudgetpermitsonlya1.4totalimprovement.PerFigure1,acrosstwoprocessgenerations(2),designerscouldincreasecorecountby2leavingfre-quencyconstant,ortheycouldincreasefre-quencyby2withleavingcorecountconstant,ortheycouldchoosesomemiddlegroundbetweenthetwo.Theremaining4potentialremainsinaccessible.Morepositivelystated,thetruenewpoten-tialofMoore’slawisa1.4energy-efficiencyimprovementpergeneration,whichcouldbeusedtoincreaseperformanceby1.4.Addi-tionally,ifwecouldsomehowmakeuseofdarksilicon,wecoulddoevenbetter.Althoughtheutilizationwallisbasedonafirst-ordermodelthatsimplifiesmanyfac-tors,ithasprovedtobeaneffectivetoolfordesignerstogainintuitionaboutthefu-ture,andhasprovenremarkablyaccurate(seethesidebar‘‘IsDarkSiliconReal?ARe-alityCheck’’).Follow-upwork6-8haslookedatextendingthisearlyworkondarksili-conandmulticorescalingwithmoresophis-ticatedmodelsthatincorporatefactorssuchasapplicationspaceandcachesize.DarksiliconmisconceptionsLet’sclearupafewmisconceptionsbeforeproceeding.First,darksilicondoesnotmeanblank,useless,orunusedsilicon;it’sjust Table1.Dennardvs.post-Dennard(leakage-limited)scaling.contrasttoDennardscaling,whichhelduntil2005,underthepost-Dennardregime,thetotalchiputilizationforafixedpowerbudgetdropsbywitheachprocessgeneration.Theresultisanexponentialincreaseindarksiliconforafixed-sizedchipunderafixedareabudget.TransistorpropertyDennardPost-DennardFrequencyCapacitance1/1/Power11.............................................................