/
Coprocessors and Attached Processors Coprocessors and Attached Processors

Coprocessors and Attached Processors - PDF document

tawny-fly
tawny-fly . @tawny-fly
Follow
380 views
Uploaded On 2017-03-26

Coprocessors and Attached Processors - PPT Presentation

ThislectureisbasedmostlyonmaterialfromTanenbaum146stextbook StructuredComputerOrganization Ref4 WeshallbeginwitharefresheronVLIWVeryLongInstructionWord designsandthenexamineanumberofco proces ID: 331403

ThislectureisbasedmostlyonmaterialfromTanenbaum’stextbook StructuredComputerOrganization (Ref.4). WeshallbeginwitharefresheronVLIW(VeryLongInstructionWord) designsandthenexamineanumberofco proces

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Coprocessors and Attached Processors" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CoprocessorsandAttachedProcessors ThislectureisbasedmostlyonmaterialfromTanenbaum’stextbook StructuredComputerOrganization (Ref.4). WeshallbeginwitharefresheronVLIW(VeryLongInstructionWord) designsandthenexamineanumberofco processors,severalofwhichareVLIW. Topics: 1 . TheVLIWdesignanditsuseinsingleprocessors. 2. TheTriMediaVLIWCPU. 3 . Heterogeneous multiprocessorsonachip:theDVDplayer. 4 . TheGlobalInternet,Ethernet,andAttachedNetworkCards 5 . T heNexperiaMediaCoprocessor 6 . Otherhigh – endvideographicscards 7 . High – endcoprocessorsforaudioproduction. 8 . Cryptoprocessors. Asweshallsee,theeconomicsofthemassmarketoftenfavortheproductionof highlyspecializedattachedprocesso rstosharethecomputingloadwiththeCPU. TheVeryLongInstructionWordDesign TheVLIWdesignisonethatwefirstencounteredwhendiscussinghigh – performance singleprocessorcomputingsystems. ThedesignassumedasuperscalarCPU,andcalled for machinecodewordswithmultipleinstructions,oneperCPUfunctionunit. Eachmachinecodewordmighthavetwointegerinstructions,onefloatingpoint instruction,andsoforth.Moderndesignsissuebundleswithanend – of – bundlemark. TheTriMediaVL IWCentralProcessingUnit TheTriMediaprocessorwasdesignedbyPhilips,theDutchelectronicscompanythat alsodesignedtheCD,andCD – ROM(Ref.4).Itisdesignedformedia – intensive applications,suchasimageprocessing,CDandDVDrecordersorpla yers,digitalvideo cameras,digitaltelevisionsets,etc.TheTriMediaisatrueVLIWprocessor. Eachmachinelanguageinstructioncommonlyspecifiesfiveoperations.Themachine wordisdividedintofiveslots,oneperoperationtobeissued.Eachslo tcommandsone ormorefunctionunits;sothatsomeslotsare“specialpurpose”. HereistheformatofatypicalTriMediamachineinstruction. TheTMS3260implementationrunsat250MHz.Sinceitcanissuefiveoperationsper clockcycle,ithasaneffe ctivemaximumratingof1250MIPS. TheTriMediahasabyte – orientedmemory.Itusesmemory – mappedI/O,inwhicheach I/Odeviceis accessedthroughregistersmappedintothememoryaddressspace. TheTriMediaProcessors HereisatabletakenfromtheWik ipediaarticleon thehistoryofTriMediaprocess ors. Core year 1st silicon ISA Features Cache(I/D) KB frequency (worstcase) introduction technology TM1000 1997 TMA0 32/16 100 MHz 500nm TM1100 1998 TMA1 32/16 133MHz 350nm TM1300 1999 TMA1 32/16 166MHz 250nm TM3260 2002 TMA2 binarycompatiblewithTM1300 64/16 250MHz 130nm TM5250 2004 TMA3 128 KBL2datacache, allocateonwritemiss,hardwareprefetching,super pipelin ed(highspeed) 64/16 450MHz 130nm TM2270 2006 TMA3 96GPRs(smallarea) 32/16 290MHz 90nm TM3270/1 2006 TMA4+ ASE lowpower 64/128 64/3232/16 350MHz 90nm TheTanenbaumtextbookisbasedontheTM3260.Notethesuccessorprocessors. 1. TheTM 5250,operatingat450MHz.Itismorepowerful. 2. TheTM2270andTM3270,designedtobesmalland/orlowinpowerconsumption. Thetwocommonmarketpressuresarehighperformanceandlowpowerusage. TheTriMediaCPU:Details TheCPUha s128general purposeregister s,eachholdinga32 – bitnumber.Twoofthe registersstoreconstantvalues:R0stores0andR1stores1.Allothersaregeneral purposeandcanstoreintegers(8,16,or32bits)orIEEE – 754floatingpointvalues. TheTMS3260has12func tionalunits,acontrolunitandelevenfordoingarithmetical, logical,andcontrolflowoperations.Someoftheseunitsrespondonlytoinstructionsin specificinstructionslots;otherscanbecommandedfromanyinstructionslot. Thelatencyisthen umberofstepstomovearesultthroughthefunctionalunit. Thelastfivecolumnsshowtheplacementofcommandsforeachfunctionalunit. TheTriMediaCPU:MathematicalUnits Thestandardarithmeticunitsusethetwo’s – complementstandardforintegerar ithmetic, buttheDSP(DigitalSignalProcessor)unitsusesaturationarithmetic. In saturationarithmetic ,anoperationthatproducesaresultnotrepresentabledueto overflowsaturatesatthemaximumvalueratherthangeneratinganexception. Forexampl e,therangeofnumbersrepresentableby8 – bitunsignedintegerarithmeticis 0through255inclusive.Insaturationarithmetic,180+180=255,themaximumvalue. Withtwominorexceptions,alloperationsintheTriMediaare predicated . Inapredicated instruction,eachoperationspecifiesaregisterthatistobetestedbefore theoperationisexecuted.Thelow – orderbitoftheregisterisexamined. 0. Ifthatbitis0,theoperationisskipped. 1. Ifthatbitis1,theoperationisexecuted. IFR2 IADDR4,R5 R8 //AddR4toR5andplaceresultintoR8. //Butonlyifbit0ofR2isa1;otherwisedonothing. UsingR1asthepredicateregistermakesitunconditionalasR1 1. UsingR0asapredicateregistermakesisano – opasR0 0. HeterogeneousProcessorExample:TheDVDPlayer ThecomputercontrollingtheDVDplayerhasanumberofverydifferentfunctions. Eachoftheseisassignedtoaspecializedprocessor. Thisdesign usesmultiplecoresonasinglelargechip.A core isa largecircuit,suchasa CPU,I/Ocontroller,orcache,thatcanbeplacedonachipinamodularway.Some modernprocessorsare dual – core inthattheyhavetwocores,eachbeingafullCPU. Thisdesignmightbecalled “heterogeneousmulti – core” .Eacho ftheclosely – coupled coreshasadedicatedfunctionrelatedtotheformatofthedataitmustprocess.This designwasfoundtobemoreeconomicalthanasinglegeneral – purposeCPU. ComputersFrom“PieceParts” Wenowfacetheissueofhowtodesigncom putersandtheirmajorcomponents. Maincomponents,suchastheCPU,willcontinuetobedesignedfrombasicgatesinthe traditionalwayforsometime.Heretheadvantageinperformancegainedfromasingle integrateddesignjustifiesthecostandeffort involved. Wenowhaveanotherattractiveoptionforthedesignofcomputingmachines.Thisone ismadeattractivebytheavailabilityofavarietyofcores,eachwithadedicatedfunction. Thiscollectionofcorescanbeconsideredessentiallyasaset oflibrariesoffunctions, onlythatthesefunctionsareimplementedinhardware. IBMhasproducedadesign,calledCoreConnect,whichisanarchitectureforconnecting coresonasingle – chipheterogeneousmultiprocessor. Hereisanexample. Notethetw obusses;oneisfasterthantheother. TheGlobalInternetandtheNetworkInterfaceCard(NIC) YoumaythinkthatyourcomputerisconnectedtotheInternet,butitisnot.The computerisconnectedtoaNIC;itisthatNICthatisconnectedtotheInt ernet. TheNICisadedicatedI/Ocoprocessor,whichcommunicateswiththecomputer’sCPU viainterruptsandDMA(DirectMemoryAccess).ExceptwhentheNICisoperatedin “promiscuousmode”(fornetworksnooping),itfiltersallpacketsbyMACaddress. T hestandardoftransmissionthatweshalldiscussiscalled “Ethernet ”.Packetsinthis protocolpossesstwo48 – bitMAC(MediaAccessControl)addresses,oneforthesource andoneforthedestinationNIC(InterfaceCard). HereistheformatofanEtherne tpacketcontaininganIPpacket. TheEthernetheadercontainsthetwoMACaddresses.EachNIChasauniqueMAC addressassignedtoitunderaprotocoladministeredbytheIEEE.Innormaluse,theNIC willrecognizemessagessenttoitsMACaddressand passonlythosetotheCPU. TheNIC(NetworkProcessor) TheNICisprogrammabledevicethatcanhandleincomingandoutgoingpackets atthefullnetworkspeed.Itispluggedintoastandardslotinthecomputermotherboard. Oneormorenetworklinesco nnecttotheboardandareroutedtothenetworkprocessor. Mostsetupshaveonlyasinglenetworklineattached,butcomputersusedasswitches, routers,andthelikemusthaveatleasttwonetworklinesattached. Hereisadiagramofatypicalnetworkpr ocessor,usingaPCIslotonthemotherboard. NotethemultiplePPE(PacketProcessingEngines).Eachisaspecializedcorewitha dedicatedtask;thesetformsapacketprocessingpipeline. TheNexperiaMediaProcessor Ordinarygeneral – purposeprocess orsarenotespeciallygoodatthemassivelyparallel computationsrequiredtoprocesshigh – resolutionaudioandvideostreams. TheNexperiaisasingle – chipheterogeneousmultiprocessordesignedbyPhilips,using itsTriMediachip.Itcomprisesaheterog eneouscollectionofcores,eachwitha dedicatedfunctionforwhichithasbee noptimized.HereisthePNX1500. MoreontheNexperia TheNexperiaisdesignedforuseeitherasacoprocessorinaPCorasastand – alonemain processorinanappliancesu chasaDVDplayer,digitalTVset,videocamera,etc. OtherthantheSRAMandSDRAMinternaltotheTriMediaprocessor,theNexperia containsnomainmemoryonthechip.ThePNX1500implementationhasaninterfaceto externalmemory,allowingfor8to2 56MBofDDRSDRAM. Thewidthofthememoryinterfaceis32bits(4bytes).ThisallowstheDDRmemoryto transfer8bytesperclockpulse;at200MHzthedatarateis1.6GB/second. Theprocessingunits(DVDDescrambler,LengthDecoder,VideoScaler,and Graphics Engine)performcomputationsrelatedtothedisplayofencryptedvideoasfoundona commercialDVD. Notethatthereisacorededicatedtodebugging.ItfollowstheJTAG(JointTestAction Group) protocols,definedinIEEEStandard1149.1 – the industrystandard. AHigh – EndGraphicsCoprocessor HerearesomedataontheNVIDIAGeForce9Series(9800GX2and9800GTX).The tableistakenfromthewebsite(Ref.6). Core Clock (MHz) Shader Clock (MHz) Memory Clock (MHz) Memory Amount Memory In terface Memory Bandwidth (GB/sec) Texture FillRate (billion/sec) 9800 GX2 600 1500 1000 1GB 512 - bit 128 76.8 9800 GTX 675 1688 1100 512MB GDDR3 256 - bit 70.4 43.2 9600 GT 650 1625 900 512MB 256 - bit 57.6 20.8 The9800GX2isamulti – coredesignwith25 6streamprocessors.Ithasa512bit (64byte)memoryinterfaceoperatingatapeakrateof128gigabytespersecond. Thisproducesvideoatresolutionsupto2560by1600pixels. Thecostofthe9800GX2is$520(Ref.6,4/16/2008). AHigh – EndAudioP rocessor HerearesomedataontheSoundBlaster XtremeGamerFatal1tyProSeries. ItisanaudioattachedcoprocessorforusewithaPC. 24 – bitAnalogtoDigitalconversion 96kHzsamplerate 24 – bitDigitaltoAnalogconversion 96kHzratetoeither7.1aud ioorstandardstereo. 64MBrandomaccessmemory,called“XRAM”. Signal – to – NoiseRatio 109dBforstereooutput TotalHarmonicDistortion 0.004% FrequencyResponse 10Hzto46kHz( – 3dBpoints) Note: Theseaudiospecificationswouldbeconsideredext remelygood forahigh – pricedaudiosystemforhomeuse. Thecostofthiscoprocessoris$150.00(Ref.7,4/16/2008) CryptographicCoprocessors SupposetwoworkstationsthataretocommunicateoverthepublicInternetinasecure mode.Theprovisiono findustrial – gradecryptographyisverycomputeintensive. Again,cryptographydoesnotlenditselftosolutionbyageneral – purposeprocessor.For thisreason,andalsotooffloadthecomputationalburdenfromtheprimaryCPU,many securecommunications ystemsuseattachedcryptographicprocessors. HerearesomedataonacryptographicprocessormarketedbyIBM(Ref.8).Theproduct describedistheIBMPCICryptographicCoprocessor. ThecoprocessorprovidesDES, triple – DES, RSA,andDSAencryption,all national standards. ThehardwareiscertifiedunderFIPSPUB140 – 1 (SecurityRequirementsfor CryptographicModules),atlevel3.Themainframeversioniscertifiedtolevel4. Thecoprocessorhasa“tamper – sensingandtamper – respondingenvironment”tol imit andreportunauthorizedaccesstotheprocessoritself. Thepriceofthisunitwasnotquoted. GameEnginesasSupercomputers Itmaysurprisestudentstolearnthatmanyofthesehigh – endgraphicsprocessorsare actuallyexportcontrolledasmunition s.Inthiscase,thecontrolisduetothepossibility ofusingtheseprocessorsashigh – performancecomputers. Inthenextslide,wepresentahigh – endgraphicscoprocessorthatcanbeviewedasa vectorprocessor.Itiscapableofasustainedrateof4 ,300Megaflops. ComparethistotheCRAY – 1supercomputerof1976,withasustainedcomputing rateof136Megaflopsandapeakrateof250Megaflops.Thisisabout3.2%ofthe performanceofthecurrentgraphicscoprocessoratabout500timesthecost. The CrayY – MPwasasupercomputersoldbyCrayResearchbeginningin1988. Itspeakperformancewas2.66Gigaflops(8processorsat333Megaflopseach). Itsmemorycomprised128,256,or512MBofstaticRAM. Theearliestsupercomputerthatcouldoutperformt hecurrentgraphicsprocessorseemsto havebeentheCrayT3E – 1200E , aMPP(MassivelyParallelProcessor) introducedin 1995 (Ref.9) . In1998,ajointscientificteamfromOakRidgeNationalLab,the UniversityofBristol(UK)andothersranasimulationrelatedtocontrolledfusionata sustainedrateof1.02Teraflop s(1020Gigaflops). Thenextslideshowsthiscurrentgraphicscoprocessor. TheNVIDIATeslaC870 DataherearefromtheNVIDIAwebsite(Ref.6).Iquotefromtheiradvertisingcopy. TheC870processorisa“massively multi – threadedprocessorarchitect ure thatisidealforhighperformance computing(HPC)applications”. Thishas128processorcores,each operatingat1.35GHz. Itsupportsthe IEEE – 754single – precisionstandard, andoperatesatasustainedrateof430 gigaflops(512GFlopspeak). Thety picalpowerusageis120watts. Notethededicatedfanforcooling. Thepriceis$1300,withanintroductory offerat$650. Theprocessorhas1.5gigabytesofDDRSDRAM,operatingat800MHz.Thedatabus tomemoryis384bits(48bytes)wide,sothatth emaximumsustaineddatarateis 482800106=76.8Gigabytespersecond. References Inthislecture,materialfromoneormoreofthefollowingreferenceshasbeenused. 1. ComputerOrganizationandDesign ,DavidA.Patterson&JohnL.Hennessy, Morgan Kaufmann,(3 rd Edition,RevisedPrinting)2007,(Thecoursetextbook) ISBN978 – 0 – 12 – 370606 – 5. 2. ComputerArchitecture:AQuantitativeApproach ,JohnL.Hennessyand DavidA. Patterson,MorganKauffman,1990.Thereisalateredition. ISBN1 – 55860 – 069 – 8. 3. High – PerformanceComputerArchitecture ,HaroldS.Stone, Addison – Wesley(ThirdEdition),1993.ISBN0 – 201 – 52688 – 3. 4. StructuredComputerOrganization ,AndrewS.Tanenbaum, Pearson/Prentice – Hall(FifthEdition),2006.ISBN 0 – 13 – 148521 – 0 5. http://en.wikipedia.org/wiki/TriMedia 6. http://www.nvidia.com 7. http://www.soundblaster.com 8. http://www - 03.ibm.com/security/cryptocards/pcicc.shtml 9. http://www.cray.com