AL AN ID: 18895
Download Pdf The PPT/PDF document "AL ANDSCAPE OF THE EW ARK ILICON ESIGN E..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
slopebutslowswitchingtimes.BothTFETsandNEMSdevicesthushintatorders-of-magnitudeimprovementsinleakagebutremainuntamedandfallshortofbeingintegratedintorealchips.Realizingtheimportanceofthefourthhorseman,arecent$194millionDARPA/MARCOSTARnetprogramisfundingfourcenters,eachfocusingonakeydirectionforbeyond-CMOSapproaches:developingelectronspin-basedmemorycomputationdevices(C-SPIN),formulatingnewin-formation-processingmodelsthatcanlever-agestatistical(thatis,nondeterministic) L1L1 CCCInternal state interfaceC-coreC-coreC-coreC-core Figure2.TheGreenDroidarchitecture,anexampleofacoprocessor-dominatedarchitecture(CoDA).TheGreenDroidMobileApplicationProcessorcomprises16nonidenticaltiles(a).Eachtile(b)holdscomponentscommontoeverytiletheCPU,on-chipnetwork(OCN),andsharedlevel-1(L1)datacacheandprovidesspaceformultipleconservationcores,orc-cores,ofvarioussizes.Avarietyofin-tilenetworks(c)connectcomponentsandc-cores.............................................................. increasesindesign,verification,andpro-grammingeffortfortheseCoDAs.Combat-ingtheTowerofBabelproblemrequiresdefininganewparadigmforhowspecializa-tionisexpressedandexploitedinfuturepro-cessingsystems.Weneednewscalablearchitecturalschemasthatemploypervasivelyspecializedhardwaretominimizeenergyandmaximizeperformancewhileatthesametimeinsulatingthehardwaredesignerandprogrammerfromsuchsystemsunderlyingcomplexity.OvercomingAmdahl-imposedlimitsonspecialization.Amdahlslawprovidesanad-ditionalroadblockforspecialization.Tosaveenergyacrossthemajorityofthecom-putation,wemustfindbroad-basedspecial-izationapproachesthatapplytobothregular,parallelcodeandirregularcode.Wemustalsoensurethatcommunicatingspecializedprocessorsdoesntfritterawaytheirenergysavingsoncostlycross-chipcommunicationorshared-memoryaccesses.Recentefforts.TheUCSDGreenDroidprocessor(seeFigure2)3,15isonesuchCoDA-basedsystemthatseekstoaddressbothcomplexityissuesandAmdahllimits.GreenDroidisamobileapplicationprocessorthatimplementsAndroidmobileenvironmenthotspotsusinghundredsofspecializedcoresconservationcores,orc-cores.1,9whichtargetbothirregularandregularcode,areautomaticallygeneratedfromCorCsourcecode,andsupportapatchingmecha-nismthatletsthemtracksoftwarechanges.Theyattainanestimated8to10energy-efficiencyimprovement,atnolossinserialperformance,evenonnonparallelcode,andwithoutanyuserorprogrammerintervention.UnlikeNTVprocessors,c-coresneednotfindadditionalparallelismintheworkloadtocoveraserialperformanceloss.Thus,c-coresarelikelytoworkacrossawiderrangeofwork-loads,includingcollectionsofserialprograms.However,forhighlyparallelworkloadsinwhichexecutiontimeislooselyconcentrated,NTVprocessorsmightholdanareaadvantagebecauseoftheirreconfigurability.OtherspecializedprocessorssuchastheUniversityofWisconsin-MadisonsDySERandtheUniversityofMichigansBeretproposealternativearchitecturesthatexploitspecializationlikec-cores,butfocusonimprovingreconfigurabilityatthecostofenergysavings.Recenteffortshavealsoexaminedtheuseofapproximateneural-network-basedcomputingasanelegantwaytopackageprogrammability,reconfi-gurability,andspecialization.ThedeusexmachinahorsemanOfthefourhorsemen,thisisbyfarthemostunpredictable.Deusexmachinareferstoaplotdeviceinliteratureortheaterinwhichtheprotagonistsseemincreasinglydoomeduntiltheverylastmoment,whensomethingcompletelyunexpectedcomesoutofnowheretosavetheday.Fordarksil-icon,onedeusexmachinawouldbeabreak-throughinsemiconductordevices.However,asweshallsee,thebreakthroughsthatwouldberequiredwouldhavetobequitefunda-mentalinfact,wemostlikelywouldhavetobuildtransistorsoutofdevicesotherthanMOSFETs.Why?BecauseMOSFETleakageissetbyfundamentalprinciplesofdevicephysics,andislimitedtoasubthresh-oldslopeof60mV/decadeatroomtemper-ature;thiscorrespondstoareductionof10leakagecurrentforevery60mVthatthethresholdvoltageisabovethess,whichisdeterminedbypropertiesofthermionicemissionofcarriersacrossapotentialwell.Thus,althoughinnovationssuchasIntelsFinFET/TriGatetransistorandhigh-dielectricsrepresentsignificantachievementsmaintainingasubthresholdslopeclosetotheirhistoricalvalues,theystillremainwith-inthescopeoftheMOSFET-imposedlimitsandareone-timeimprovementsratherthanscalablechanges.TwoVLSIcandidatesthatbypasstheselimitsbecausetheyarenotbasedonthermalinjectionaretunnelfield-effecttransistorswhicharebasedontunnelingeffects,andnanoelectromechanicalsystem(NEMS)switches,whicharebasedonphysicalrelays.TFETsarereputedtohavesubthresholdslopesontheorderof30mV/decadetwiceasgoodastheidealMOSFETbutwithloweron-currentsthanMOSFETs,limitingtheiruseinhigh-performancecircuits.NEMSdeviceshaveessentiallyanear-zerosubthreshold.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON subsetofcachetransistors(suchasaword-line)isaccessedeachcycle,cachememorieshavelowdutycyclesandthusareinherentlydark.Comparedtogeneral-purposelogic,alevel-1(L1)cacheclockedatitsmaximumfrequencycanbeabout10darkerpersquaremillimeter,andlargercachescanbeevendarker.Thus,addingcacheisonewaytosimultaneouslyincreaseperformanceandlowerpowerdensitypersquaremillimeter.Wecanimagine,forinstance,expandingper-corecacheataratethatsoaksuptheremainingdarksiliconarea:1.4to2morecachepercorepergeneration.How-ever,manyapplicationsdonotbenefitmuchfromadditionalcache,andupcomingTSV-integratedDRAMwillreducethecachebenefitforthoseapplicationsthatdo.ComputationalsprintingandTurboOthertechniquesemploytemporaldimnessasopposedtospatialdimness,temporarilyexceedingthenominalthermalbudgetbutrelyingonthermalcapacitancetobufferagainsttemperatureincreases,andthenrampingbacktoacomparativelydarkstate.IntelsTurboBoost2.0usesthisapproachtoboostperformanceupuntiltheprocessorreachesnominaltemperature,rely-ingontheheatsinksinnatethermalcapaci-tance.ARMsbig.LITTLEemploysfourA15coresuntilthethermalenvelopeisexceeded(anecdotally,about10seconds),thenswitchesovertofourlower-energy,lower-performanceA7cores.Computationalsprintingcarriesthisastepfurther,employ-ingphase-changematerialsthatletchipsex-ceedtheirsustainablethermalbudgetbyanorderofmagnitudeforseveralseconds,pro-vidingashortbutsubstantialcomputationalboost.Thesemodesareespeciallyusefulforracetofinishcomputations,suchasweb-pagerendering,forwhichresponselatencyisimportant,orforwhichspeedingupthetransitionofboththeprocessoranditssup-portlogictoalow-powerstatereducesen-ergyconsumption.ThespecializedhorsemanThespecializedhorsemanusesdarksili-contoimplementahostofspecializedco-processors,eacheithermuchfasterormuchmoreenergyefficient(100to1,000)thanageneral-purposeprocessor.hopsbetweencoprocessorsandgeneral-purposecores,executingwhereitismostef-ficient.Theunusedcoresarepower-andclock-gatedtokeepthemfromconsumingpreciousenergy.Unlikedimsilicon,whichtendstofocusonmanipulatingvoltages,fre-quencies,anddutycyclesaswaystomanagepower,specializedlogicfocusesonreducingtheamountofcapacitancethatneedstobeswitchedtoperformaparticularoperation.Thepromiseforafutureofwidespreadspecializationisalreadybeingrealized:weareseeingaproliferationofspecializedaccel-eratorsthatspandiverseareassuchasbase-bandprocessing,graphics,computervision,andmediacoding.Theseacceleratorsenableorders-of-magnitudeimprovementsinen-ergyefficiencyandperformance,especiallyforcomputationsthatarehighlyparallel.Recentproposalshaveextrapolatedthistrendandanticipatethatthenearfuturewillseesystemscomprisingmorecoproces-sorsthangeneral-purposeprocessors.1,7Thisarticlereferstothesesystemsascoprocessor-dominatedarchitectures,orCoDAs.Asspecializationusagegrowstocombatthedarksiliconproblem,wearefacedwithamodern-dayspecializationTowerofBabelcrisisthatfragmentsournotionofgeneral-purposecomputationandeliminatesthetraditionalclearlinesofcommunicationbetweenprogrammersandsoftwareandtheunderlyinghardware.Already,weseethedeploymentofspecializedlanguagessuchasCUDAthatarenotusablebetweensimilararchitectures(forexample,AMDandNvi-dia).Weseeoverspecializationproblemsbe-tweenacceleratorsthatcausethemtobecomeinapplicabletocloselyrelatedclassesofcom-putations(suchasdouble-precisionscientificcodesrunningincorrectlyonaGPUsnon-IEEE-compliantfloating-pointhardware).Adoptionproblemsarealsocausedbytheex-cessivecostsofprogrammingheterogeneoushardware(suchastheslowuptakeofSonyPlayStation3versusXbox).Moreover,spe-cializedhardwarerisksobsolescenceasstan-dardsarerevised(forexample,aJPEGstandardrevision).Insulatinghumansfromcomplexity.factorsspeaktopotentialexponential............................................................. digitally.However,analogtechniquesmightnotscalewelltodeepnanometertechnology.Fast,static,gather,reduce,andbroad-castoperators.Neuronshavefanoutandfaninofapproximately7,000tootherneuronsthatarelocatedsignifi-cantdistancesaway.Effectively,theycanperformefficientoperationsthatcombinevector-stylegathermemoryaccessestolargenumbersofstatic-memorylocations,withavector-stylereductionoperatorandabroadcast.Domoreefficientwaysexistforimple-mentingtheseoperationsinsilicon?Itcouldbeusefulforcomputationsthatoperateonfinite-sizedstaticgraphs.Recently,boththeEUandUSgovern-mentshaveproposedinitiativestoenablegreaterstudiesofthecomputationalcapabil-itiesofthebrain.Althoughbrain-inspiredcomputinghasalreadycomeandgonesev-eraltimesinthebriefhistoryofmanmadecomputers,darksiliconmaycausetheseapproachestobecomeincreasinglyrelevant.lthoughsiliconisgettingdarker,forresearchersthefutureisbrightandex-citing.Darksiliconwillcauseatransforma-tionofthecomputationalstackandprovidemanyopportunitiesforinvestigation.MICROAcknowledgmentsThisworkwaspartiallysupportedbyNSFawards0846152,1018850,0811794,and1228992,NokiaandAMDgifts,andbySTARnet,anSRCprogramsponsoredbyMARCOandDARPA.Ithanktheanon-ymousreviewersfortheirvaluableinsightsandsuggestions.....................................................................References1.G.Venkateshetal.,ConservationCores:ReducingtheEnergyofMatureComputa-tions,Proc.15thArchitecturalSupportforProgrammingLanguagesandOp-eratingSystemsConf.,ACM,2010,pp.205-218.2.R.Merrit,ARMCTO:PowerSurgeCouldCreateDarkSilicon,EETimes,22Oct.3.N.Gouldingetal.,GreenDroid:AMobileApplicationProcessorforaFutureofDarkSilicon,HotChipsSymp.,2010.4.M.Taylor,IsDarkSiliconUseful?Harness-ingtheFourHorsemenoftheComingDarkSiliconApocalypse,Proc.49thAnn.DesignAutomationConf.(DAC12),ACM,2012,pp.1131-1136.5.R.H.Dennard,DesignofIon-ImplantedMOSFETswithVerySmallPhysicalDimen-sions,IEEEJ.Solid-StateCircuits,vol.SC-9,1974,pp.256-268.6.H.Esmaeilzadehetal.,DarkSiliconandtheEndofMulticoreScaling,ACMSIGARCHComputerArchitectureNews,vol.39,no.3,2011,pp.365-376.7.N.Hardavellasetal.,TowardDarkSiliconinServers,IEEEMicro,vol.31,no.4,2011,pp.6-15.8.W.Huangetal.,ScalingwithDesignCon-straints:PredictingtheFutureofBigChips,IEEEMicro,vol.31,no.4,2011,pp.16-29.9.J.Sampsonetal.,EfficientComplexOper-atorsforIrregularCodes,Proc.17thIntlSymp.HighPerformanceComputerArchi-(HPCA11),IEEECS,2011,pp.491-502.10.A.Raghavanetal.,ComputationalSprint-ing,Proc.IEEE18thIntlSymp.High-PerformanceComputerArchitecture12),IEEECS,2012,doi:10.1109/HPCA.2012.6169031.11.R.Dreslinskietal.,Near-ThresholdCom-puting:ReclaimingMooresLawThroughEnergyEfficientIntegratedCircuits,Proc.vol.98,no.2,2010,pp.253-266.12.E.Krimeretal.,Synctium:ANear-ThresholdStreamProcessorforEnergy-ConstrainedParallelApplications,IEEEComputerArchi-tectureLetters,Jan.2010,pp.21-24.13.D.Ficketal.,Centip3de:A3930DMIPS/WConfigurableNear-Threshold3DStackedSystemwith64ARMCortex-M3Cores,Proc.IEEEIntlSolid-StateCircuitsConf.,IEEE,2012,pp.190-192.14.S.Jainetal.,A280mV-to-1.2VWide-Operating-RangeIA-32Processorin32nmProc.IEEEIntlSolid-StateCircuitsIEEE,2012,pp.66-68.15.N.Goulding-Hottaetal.,TheGreenDroidMobileApplicationProcessor:AnArchitec-tureforSiliconsDarkFuture,IEEEMicro,vol.31,no.2,2011,pp.86-95..............................................................IEEEMICRO...............................................................................................................................................................................................ILICON siliconthatisnotusedallthetime,oratitsfullfrequency.EvenduringthebestdaysofCMOSscaling,microprocessorandothercircuitswerechockfullofdarklogicusedinfrequentlyorforonlysomeapplica-tionsforinstance,cachesareinherentlydarkbecausetheaveragecachetransistorisswitchedforfarlessthanonepercentofcycles,andFPUsremaindarkinintegercodes.Soon,theexponentialgrowthofdarksil-iconareawillpushusbeyondlogictargetedfordirectperformancebenefitstowardswathsoflow-dutycyclelogicthatexists,notfordirectperformancebenefit,butforimprovingenergyefficiency.Thisimprovedenergyefficiencycanthenallowanindirectperformanceimprovementbecauseitfreesupmoreofthefixedpowerbudgettobeusedforevenmorecomputation.ThefourhorsemenRecently,researchersproposedataxon-omythefourhorsementhatidentifiesfourpromisingdirectionsfordealingwithdarksiliconthathaveemergedaspromisingpotentialapproachesaswetransitionbeyondtheinitialmulticorestop-gapsolution.Theseresponsesoriginallyappearedtobeunlikelycandidates,carryingunwelcomeburdensindesign,manufacturing,orprogramming.Noneisidealfromanaestheticengineering 4 cores at 1.8 GHz 4 cores at 2(12 cores dark) 4 cores at 1.8 GHz(8 cores dark, 8 dim) (Industrys choice) 65 nm32 nmSpectrum of trade-offsbetween no. of cores andfrequency Figure1.Multicorescalingleadstolargeamountsofdarksilicon.Acrosstwoprocessgen-erations,thereisaspectrumoftrade-offsbetweenfrequencyandcorecount;theseincludeincreasingcorecountby2butleavingfrequencyconstant(top),andincreasingfrequencyby2butleavingcorecountconstant(bottom).Anyofthesetrade-offpointswillhavelargeamountsofdarksilicon. ...............................................................................................................................................................................................IsDarkSiliconReal?ARealityCheckAquicksurveyofrecentdesignsfrommulticoreoutfitssuchasTilera,Intel,andAMDindicatesthatindustryhaspursuedcorecountandfre-quencycombinationsconsistentwiththeutilizationwall.Forinstance,Intels90-nmsingle-corePrescottchipranat3.8GHzin2004.Dennardscalingwouldsuggestthata22-nmmulticoreversionshouldrunat15.5GHz,andcontain17superscalarcores,foratotalimprovementof69instructionthroughput.Instead,theupcoming201322-nmIntelCorei74960Xrunsat3.6GHzandhassixsuperscalarcores,a5.7peakserialinstructionthroughputimprovement.Thedarknessratioisthus91.74per-centversusthe93.75percentpredictedbytheutilizationwall.Thelatest2012InternationalTechnologyRoadmapforSemiconductorsalsoshowsthatscalinghasproceededconsistentlywithpost-Dennardpredictions..............................................................IEEEMICRO...............................................................................................................................................................................................ILICON architectureisstrategicallymanagingthechip-widetransistordutycycletoenforcetheoverallpowerconstraint.8,9Whereasearly90-nmdesignssuchasCellandPre-scottweredimmedbecauseactualpowerexceededdesign-estimatedpower,weareconvergingonincreasinglymoreelegantmethodsthatmakebettertrade-offs.Dimsilicontechniquesincludedynami-callyvaryingthefrequencywiththenumberofcoresbeingused,scalinguptheamountofcachelogic,employingnear-thresholdvolt-age(NTV)processordesigns,andredesign-ingthearchitecturetoaccommodateburststhattemporarilyallowthepowerbudgettobeexceeded,suchasTurboBoostandcom-putationalsprinting.TurboBoost1.0.Althoughfirst-generationmulticoreshadaship-time-determinedtopfrequencythatwasinvariantofthenumberofcurrentlyactivecores,IntelsTurboBoost1.0enabledsecond-generationmulti-corestomakereal-timetrade-offsbetweenactivecorecountandthefrequencythecoresranat:thefewerthecores,thehigherthefrequency.WhenTurboBoostisenabled,itusestheenergygainedfromturningoffcorestoincreasethevoltageandthenthefre-quencyoftheactivecores.Thistechnique,knownasdynamicvoltageandfrequencyscaling(DVFS),increasespowerproportionaltothecubeoftheincreaseinfrequency.NTVprocessors.Inthepast,DVFSwasalsousedtosavecubicpowerwhenfrequenciesweredecreased.However,today,processormanufacturersoperatetransistorsatreducedvoltagesaround2.5thethresholdvolt-age,anenergy-delayoptimalpoint.Thispointisrightattheedgeofanoperatingre-gimewherefrequencystartstodropprecipi-touslyasvoltageisreduced,whichmakesdownward-DVFSmuchlesseffective.Nonetheless,researchershavebeguntoexplorethisregime.OnerecentapproachisNear-ThresholdVoltage(NTV)logic,whichoperatestransistorsinthenear-thres-holdregimeslightlyabovethethresholdvolt-age,providingmorepalatabletrade-offsbetweenenergyanddelaythansubthresholdcircuits,forwhichfrequencydropsexponen-tiallywithvoltagedecreases.Researchershaveexploredwide-SIMDNTVprocessors,whichseektoexploitdataparallelism,alongwithNTVmany-coreprocessorsandanNTVx86processor.AlthoughNTVper-processorperformancedropsfasterthanthecorrespondingsavingsinenergy-per-instruction(5energyimprove-mentforan8performancecost),theperfor-mancelosscanbeoffsetbyusing8moreprocessorsinparalleliftheworkloadallowsit.Then,anadditional5processorscouldturntheenergyefficiencygainsintoadditionalperformance.So,withidealparallelization,NTVcouldoffer5thethroughputim-provementbyabsorbing40thearea.Butthiswouldalsorequire40morefreeparal-lelismintheworkloadrelativetotheparallel-ismconsumedbyanequivalentenergy-limitedsuper-thresholdmany-coreprocessor.Inpractice,formanyapplications,40additionalparallelismcanbeelusive.Forchipswithlargepowerbudgetsthatcanal-readysustainhundredsofcores,applicationsthathavethismuchspareparallelismarerel-ativelyrare.Interestingly,becauseofthisef-fect,NTVsapplicabilityacrossapplicationsincreasesinlow-energyenvironmentsbecausetheenergy-limitedbaselinesuper-thresholddesignhasconsumedlessoftheavailablepar-allelism.Furthermore,NTVclearlybecomesmoreapplicableforworkloadswithextremelylargeamountsofparallelism.NTVpresentsseveralcircuit-relatedchal-lengesthathaveseenactiveinvestigation,es-peciallybecausetechnologyscalingwillexacerbateratherthanamelioratethesefactors.AsignificantNTVchallengehasbeensuscep-tibilitytoprocessvariability.Asoperatingvol-tagesdrop,variationintransistorthresholdduetorandomdopantfluctuationispropor-tionallyhigher,andleakageandoperatingfre-quencycanvarygreatly.BecauseNTVdesignscanexpandtheareaconsumptionbyapproximately8ormore,variationissuesareexacerbated.Otherchallengesincludethepenaltiesinvolvedindesigninglow-operatingvoltagestaticRAMs(SRAMs)andtheincreasedinterconnectionenergyconsump-tionduetogreaterspreadingacrosscores.Biggercaches.Anoften-proposeddim-siliconalternativeistosimplyallocateotherwisedarksiliconareaforcaches.Becauseonlya.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON heterogeneity,becausedesignswerelargelymeasuredaccordingtoasingleaxisperformance.Tofirstorder,therewasasingleoptimaldesignpoint.Nowthatperformanceandenergyarebothimportant,aParetocurvetradesoffper-formanceandenergy,andthereisnooneoptimaldesignacrossthatcurve;therearemanyoptimalpoints.Optimaldesignswillincorporateseveralsuchpointsacrossthesecurves.Theserulesofthumbwillguideourexist-ingdesignsalonganevolutionarypathtobe-comeincreasinglydarksiliconfriendlybutwhatthenofmorerevolutionaryapproaches?Insightsfromthebrain:adarktechnologyPerhapsonepromisingindicatorthatlow-dutycycle,darktechnologycanbemas-tered,unlockingnewapplicationdomains,istheefficiencyanddensityofthehumanbrain.Thebrain,eventoday,canperformmanytasksthatcomputerscannot,especiallyvision-relatedtasks.With80billionneuronsand100trillionsynapsesoperatingatlessthan100mV,thebrainembodiesanexistenceproofofhighlyparallel,reliable,anddarkoperation,andembodiesthreeofthehorsemendim,specialized,anddeusexmachina.Neuronsoperatewithextremelylow-dutycyclescomparedtoprocessorsatbest,1kilohertz.Althoughcomputingwithsil-icon-simulatedneuronsintroducesexcessiveinterpretiveoverheadsneuronsandtransis-torshavefundamentallydifferentpropertiesthebraincanofferusinsightandlong-termideasabouthowwecanredesignsystemsfortheextremelylow-dutycyclesandlowvoltagescalledforbydarksilicon.Herearesomeoftheseproperties,whichmaygiveusinsightonmorerevolutionaryextensionstotheevolu-tionaryprinciplesproposedinthelastsection:Specialization.Aswiththespecializedhorseman,differentgroupsofneuronsservedifferentfunctionsincognitiveprocessing,connecttodifferentsensoryorgans,andallowreconfiguration,evolvingwithtimesynapticconnec-tionscustomizedtothecomputation.Verydarkoperation.Neuronsfireatamaximumrateofapproximately1,000switchespersecond.Comparethistoarithmeticlogicunit(ALU)transistorsthattoggleatthreebilliontimespersecond.Themostactiveneuronsactiv-ityisamillionthofthatofprocessingtransistorsintodaysprocessors.Low-voltageoperation.Braincellsoper-ateatapproximately100mV,yieldingenergysavingsof1001-Voperation,inaclearparalleltothedimhorsemansNTVcircuits.Communicationislowswingandlowvoltage,savinglargeamountsofenergy.Limitedsharingandmemorymultiplex-ing.Anygivenneuroncanswitchonly1,000timespersecond,bydefinition,soitmusthaveextremelylimitedshar-ing,becauseapointofmultiplexingwouldbeabottleneckinparallelpro-cessing.Thehumanvisualsystemstartswith6Mconesintheretina,similartoa2-megapixeldisplay,processesitwithlocalneurons,andthensendsitonthe1M-neuronopticnervetothevisualcortex.Thereisnocentralmemorystore;eachpixelhasasetofitsownALUs,sotospeak,soenergywasteduetomultiplexingisminimal.Datadecimation.Thehumanbrainreducesthedatasizeateachstepandoperatesonconcisebutapproximaterepresentations.Ifusing2megapixelssufficestohandlecolor-relatedvisiontasks,whyusemorethanthat?Largersensorswouldjustrequiremoreneu-ronstostoreandcomputeonthedata.Weshouldensurethatwearepro-cessingnomoredatathannecessarytoachievethefinaloutcome.Analogoperation.Theneuronperformsamorecomplexbasicoperationthanthetypicaldigitaltransistor.Ontheinputside,neuronscombineinforma-tionfrommanyotherneurons;andontheoutput,despiteproducingrail-to-raildigitalpulses,encodemultiplebitsofinformationviaspikestimings.Couldthissuggestthattherearemoreefficientwaystomapoperationsontosilicon-basedtechnologies?InRFwire-lessfront-endcommunications,analogprocessingenablescomputationsthatwouldbeimpossibletodoatspeed............................................................. pointofview.Butthesuccessofcomplexmultiregimedevicessuchasmetal-oxide-semiconductorfield-effecttransistors(MOS-FETs)hasshownthatengineerscantoleratecomplexityiftheendresultisbetter.Futurechipsarelikelytoemploynotjustonehorse-man,butallofthem,ininterestinganduniquecombinations.TheshrinkinghorsemanWhenconfrontedwiththepossibilityofdarksilicon,manychipdesignersinsistthatareaisexpensive,andthattheywouldjustbuildsmallerchipsinsteadofhavingdarksil-iconintheirdesigns.Amongthefourhorse-men,theseshrinkingchipsarethemostpessimisticoutcome.Althoughallchipsmayeventuallyshrinksomewhat,theonesthatshrinkthemostwillbethoseforwhichdarksiliconcannotbeappliedfruit-fullytoimprovetheproduct.Thesechipswillrapidlyturnintolow-marginbusinessesforwhichfurthergenerationsofMooreslawprovidesmallbenefit.Belowisanexam-inationofthespectrumofsecond-ordereffectsassociatedwithshrinkingchips.Costsideofshrinkingsilicon.Understandingshrinkingchipsrequiresconsideringsemi-conductoreconomics.Thebuildsmallerchipsargumenthasaringoftruth;afterall,designersspendmuchoftheirtimetryingtomeetareabudgetsforexistingchipdesigns.Butexponentiallysmallerchipsarenotexponentiallycheaper;evenifsiliconbeginsas50percentofsystemcost,afterafewprocessgenerations,itwillbeatinyfrac-tion.Maskcosts,designcosts,andI/Opadareawillfailtobeamortized,leadingtoris-ingcostspermmofsilicon,whichulti-matelywilleliminateincentivestomovethedesigntothenextprocessgeneration.Thesedesignswillbeleftbehindonoldergenerations.Revenuesideofshrinkingsilicon.Shrinkingsiliconcanalsoshrinkthechipsellingprice.Inacompetitivemarket,ifthereisawaytousethenextprocessgenerationsbountyofdarksilicontoattainabenefittotheendproduct,thencompetitionwillforcecompaniestodoso.Otherwise,theywillgenerallybeforcedintolow-end,low-margin,high-competitionmarkets,andtheircompetitorwilltakethehighendandenjoyhighmargins.Thus,inscenarioswheredarksiliconcouldbeusedprofitably,decreasingareainlieuofexploitingitwouldcertainlydecreasesystemcosts,butwouldcatastrophicallydecreasesaleprice.Hence,theshrinking-chipsscenarioislikelytohappenonlyifwecanfindnopracticalusefordarksilicon.Powerandpackagingissueswithshrinkingchips.Amajorconsequenceofexponentiallyshrinkingchipsisacorrespondingexponen-tialriseinpowerdensity.Recentanalysisofmany-corethermalcharacteristicshasshownthatpeakhotspottemperaturerisecanbemodeledasmaxTDPconv,wheremaxistheriseintemperature,TDPisthetargetchipthermaldesignpower,convistheheatsinkthermalconvectionre-sistance(lowerisabetterheatsink),incor-poratesmany-coredesignproperties,andchiparea.Ifareadropsexponentially,thesecondtermdominatesandchiptempera-turesriseexponentially.ThisinturnwillforcealowerTDPsothattemperaturelimitsaremet,andreducescalingbeloweventhenominal1.4expectedenergy-efficiencygain.Thus,ifthermalsdriveyourshrink-ing-chipstrategy,itismuchbettertoholdyourfrequencyconstantandincreasecoresby1.4withanetareadecreaseof1.4thanitistoincreaseyourfrequencyby1.4andshrinkyourchipby2ThedimhorsemanAsexponentiallylargerfractionsofachipstransistorsbecomedarktransistors,siliconareabecomesanexponentiallycheaperre-sourcerelativetopowerandenergyconsump-tion.Thisshiftcallsfornewarchitecturaltechniquesthatspendareatobuyenergyeffi-ciency.Ifwemovepastunhappythoughtsofshrinkingsiliconandconsiderpopulatingdarksiliconareawithlogicthatweuseonlypartofthetime,thenweareledtosomein-terestingnewdesignpossibilities.Thetermdimsiliconreferstotechniquesthatputlargeamountsofotherwise-darksiliconareatoproductiveusebyemployingheavyunderclockingorinfrequentusetomeetthepowerbudgetthatis,the............................................................. beyond-CMOSdevices(SONIC),engineer-ingnonconventionalatomicscaleengineeredmaterials(FAME),andcreatingnewdevicesthatextendpriorworkonTFETstooperateatevenlowervoltages(LEAST).EvolutionarydesignprinciplesfordarkWhileresearchersworktomaturethenewideasrepresentedbythefourhorsemen,whatprinciplesshouldguidetodaysdesignsthatmusttackledarksilicon?Listedbelowisasetofevolutionary,ratherthanrevolutionary,darksilicondesignprinciplesthataremoti-vatedbychangingtrade-offscreatedbydarksilicon:Movingtothenextgenerationwillpro-videanautomatic1.4energy-efficiencyincrease.Figureouthowyouwilluseit.Asabaseline,chipcapabilitieswillscalewithenergy,whetheritisallocatedtofrequencyormorecores.Youcanin-creaseordecreasefrequencyortransis-torcounts,buttransistorsswitchedperunittimecanincreasebyonly1.4Thenextgenerationwillcreatealargeamountofdarkarea.Determine,foryourdomain,howtotrademostlydarkareaforenergy.Ifthedieareaisfixed,anyscalingisgoingtohaveasurplusoftransistors.Whichcombinationofthefourhorsemenismosteffectiveinyourdomain?Shouldyougodimmorecaches?Underclockedarraysofcores?NTVontopofthat?Addaccel-eratorsorc-cores?Usenewkindsofde-vices?Shrinkyourchip?Pipeliningmakeslesssensethanitusedto.Figureoutiffastertransistordelayswillallowyoutofitmoreinapipelinestagewithoutreducingfrequency.Pipeliningincreasesdutycycleandintroducesaddi-tionalcapacitanceincircuits(registers,predictioncircuits,bypassing,andclocktreefanout),neitherofwhichisdarksil-iconfriendly.ReducingpipelinedepthandincreasingFO4depthsreducescapacitiveoverhead.Note,too,thatexces-sivepipeliningandfrequencyexacerbatesthegapbetweenprocessingandmemory.Architecturalmultiplexingandlogicshar-ingarebecomingincreasinglyquestionableoptimizations.Seeiftheystillmakesense.Sharingintroducesadditionalenergyconsumptionbecauseitrequiressharerstohavelongerwirestothesharedlogic,anditintroducesadditionalperfor-manceandenergyoverheadsfromthecontrollogicthatmanagesthesharing.Forexample,architecturesthathaverepositoriesofnonsharedstatethatsharephysicalpipelines(suchaslarge-scalemultithreading)paylargewirecapacitancesinsidethesememoriestosharethatstate.Asareagetscheaper,itwillmakelesssensetopaytheseover-heads,andthedegreeofsharingwillde-creasesothattheenergycostofpullingstateoutofthesestaterepositorieswillbereduced.MultiplexingandRAMsthatfacilitatesharingofprogramdataarestillagoodidea.Keepthem.Ifdifferentthreadsofcontrolaretrulysharingdata,multi-plexedstructures,suchassharedRAM,orcrossbars,areoftenstillmoreefficientthancoherenceprotocolsorotherschemes.Architecturaltechniquesforsavingtran-sistorsshouldonlybeappliediftheydonotworsenenergyefficiency.Transistorsaregettingexponentiallycheaper,andwecantusethemallatonce.Whyarewetryingtosavetransistors?Lo-cally,transistor-savingoptimizationsmakesense,butanexponentialwindisblowingagainsttheseoptimizationsinthelongrun.Powerrailsarethenewclocks.Designwiththeminmind.Tenyearsago,itwasabigsteptomovebeyondafewclockdomains.Now,chipscanhavehundredsofclockdomains,allwiththeirownclockgates.Withdarksilicon,wewillseethesameeffectwithpowerrails;wewillhavehundredsandmaybethousandsofpowerrailsinthefuture,allwiththeirownpowergates,tomanagetheleakageforthemanyhet-erogeneoussystemcomponents.Heterogeneityresultsfromtheshiftfroma1Dobjectivefunction(performance)toa2Dobjectivefunction(performanceandenergy).Designwiththeshapeofthisfunctioninmind.Thepastlackedin.............................................................IEEEMICRO...............................................................................................................................................................................................ILICON . ..................................................................................................................................................................................................................ANDSCAPEOFTHE 16.V.Govindaraju,C.-H.Ho,andK.Sankaralin-gam,DynamicallySpecializedDatapathsforEnergyEfficientComputing,Proc.IEEE17thIntlSymp.High-PerformanceComputerArchitecture(HPCA11),IEEECS,2011,doi:10.1109/HPCA.2011.5749755.17.S.Guptaetal.,BundledExecutionofRe-curringTracesforEnergy-EfficientGeneralPurposeProcessing,Proc.44thAnn.IEEE/ACMIntlSymp.Microarchitecture,ACM,2011,pp.12-23.18.H.Esmaeilzadehetal.,NeuralAccelerationforGeneral-PurposeApproximatePro-grams,Proc.45thAnn.IEEE/ACMIntlSymp.Microarchitecture,IEEECS,2012,pp.449-460.19.A.Ionescuetal.,TunnelField-EffectTransis-torsasEnergy-EfficientElectronicSwitches,Nature,17Nov.2011,pp.329-337.20.F.Chenetal.,DemonstrationofIntegratedMicro-Electro-MechanicalSwitchCircuitsforVLSIApplications,Proc.IEEEIntlSolid-StateCircuitsConf.,IEEE,2010,pp.150-151.MichaelB.TaylorisanassociateprofessorintheDepartmentofComputerScienceandEngineeringattheUniversityofCalifornia,SanDiego,whereheleadstheCenterforDarkSilicon.Hisresearchinterestsincludedarksilicon,chipdesign,parallelizationtools,andBitcoincomputingsystems.TaylorhasaPhDinelectricalengineeringandcomputersciencefromtheMassachu-settsInstituteofTechnology.DirectquestionsandcommentsaboutthisarticletoMichaelB.Taylor,9500GilmanDrive,MC0404EBU3B3202,LaJolla,CA92093-0404;mbtaylor@ucsd.edu. ............................................................. solutiontodarksilicon;itismerelyindus-trysinitial,transitionalresponsetotheshockingonsetofthedarksiliconage.In-creasinglyovertime,thesemiconductorin-dustryisadaptingtothisnewdesignregime,realizingthatmulticorechipswillnotscaleastransistorsshrinkandthatthefractionofachipthatcanbefilledwithcoresrunningatfullfrequencyisdroppingexponentiallywitheachprocessgenera-tion.1,3Thisrealityforcesdesignerstoensurethat,atanypointintime,largefractionsoftheirchipsareeffectivelydarkeitheridleforlongperiodsoftimeorsignificantlyunderclocked.Asexponentiallylargerfrac-tionsofachipstransistorsbecomedarker,siliconareabecomesanexponentiallycheaperresourcerelativetopowerandenergyconsumption.Thisshiftcallsfornewarchi-tecturaltechniquesthatspendareatobuyenergyefficiency.Thissavedenergycanthenbeappliedtoincreaseperformance,ortohavelongerbatterylifeorloweroperat-ingtemperatures.TheutilizationwallthatcausesdarksiliconTable1showsthederivationoftheutiliza-tionwallthatcausesdarksilicon.employsascalingfactor,,whichistheratiobetweenthefeaturesizesoftwoprocesses(forexample,between32and22nm).InbothDennardandpost-Dennardscaling,thetransistorcountscales,andthetransistorswitchingfrequencyscalesbyThus,ournetincreaseincomput-ingperformanceis,or2.8However,tomaintainaconstantpowerenvelope,thesegainsmustbeoffsetbyacor-respondingreductionintransistorswitchingenergy.Inbothcases,scalingreducestransis-torcapacitanceby,improvingenergyeffi-ciencybyInDennardscaling,wecanscalethethresholdvoltageandthustheoper-atingvoltage,whichyieldsanotherenergy-efficiencyimprovement.However,intodayspost-Dennard,leakage-limitedregime,wecannotscalethresholdvoltagewithoutexpo-nentiallyincreasingleakage,andasaresult,wemustholdoperatingvoltageroughlycon-stant.Theendresultisashortfallof,or2perprocessgeneration.Thisshortfallmulti-plieswitheachprocessgeneration,resultinginexponentiallydarkersiliconovertime.Thisshortfallpreventsmulticorefrombeingthesolutiontoscaling.1,3advancingasingleprocessgenerationwouldallowenoughtransistorstoincreasecorecountby2,andfrequencycouldbe1.4faster,theenergybudgetpermitsonlya1.4totalimprovement.PerFigure1,acrosstwoprocessgenerations(2),designerscouldincreasecorecountby2leavingfre-quencyconstant,ortheycouldincreasefre-quencyby2withleavingcorecountconstant,ortheycouldchoosesomemiddlegroundbetweenthetwo.Theremaining4potentialremainsinaccessible.Morepositivelystated,thetruenewpoten-tialofMooreslawisa1.4energy-efficiencyimprovementpergeneration,whichcouldbeusedtoincreaseperformanceby1.4.Addi-tionally,ifwecouldsomehowmakeuseofdarksilicon,wecoulddoevenbetter.Althoughtheutilizationwallisbasedonafirst-ordermodelthatsimplifiesmanyfac-tors,ithasprovedtobeaneffectivetoolfordesignerstogainintuitionaboutthefu-ture,andhasprovenremarkablyaccurate(seethesidebarIsDarkSiliconReal?ARe-alityCheck).Follow-upwork6-8haslookedatextendingthisearlyworkondarksili-conandmulticorescalingwithmoresophis-ticatedmodelsthatincorporatefactorssuchasapplicationspaceandcachesize.DarksiliconmisconceptionsLetsclearupafewmisconceptionsbeforeproceeding.First,darksilicondoesnotmeanblank,useless,orunusedsilicon;itsjust Table1.Dennardvs.post-Dennard(leakage-limited)scaling.contrasttoDennardscaling,whichhelduntil2005,underthepost-Dennardregime,thetotalchiputilizationforafixedpowerbudgetdropsbywitheachprocessgeneration.Theresultisanexponentialincreaseindarksiliconforafixed-sizedchipunderafixedareabudget.TransistorpropertyDennardPost-DennardFrequencyCapacitance1/1/Power11.............................................................