ThislectureisbasedmostlyonmaterialfromTanenbaum146stextbook StructuredComputerOrganization Ref4 WeshallbeginwitharefresheronVLIWVeryLongInstructionWord designsandthenexamineanumberofco proces ID: 331403
Download Pdf The PPT/PDF document "Coprocessors and Attached Processors" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
CoprocessorsandAttachedProcessors ThislectureisbasedmostlyonmaterialfromTanenbaumstextbook StructuredComputerOrganization (Ref.4). WeshallbeginwitharefresheronVLIW(VeryLongInstructionWord) designsandthenexamineanumberofco processors,severalofwhichareVLIW. Topics: 1 . TheVLIWdesignanditsuseinsingleprocessors. 2. TheTriMediaVLIWCPU. 3 . Heterogeneous multiprocessorsonachip:theDVDplayer. 4 . TheGlobalInternet,Ethernet,andAttachedNetworkCards 5 . T heNexperiaMediaCoprocessor 6 . Otherhigh endvideographicscards 7 . High endcoprocessorsforaudioproduction. 8 . Cryptoprocessors. Asweshallsee,theeconomicsofthemassmarketoftenfavortheproductionof highlyspecializedattachedprocesso rstosharethecomputingloadwiththeCPU. TheVeryLongInstructionWordDesign TheVLIWdesignisonethatwefirstencounteredwhendiscussinghigh performance singleprocessorcomputingsystems. ThedesignassumedasuperscalarCPU,andcalled for machinecodewordswithmultipleinstructions,oneperCPUfunctionunit. Eachmachinecodewordmighthavetwointegerinstructions,onefloatingpoint instruction,andsoforth.Moderndesignsissuebundleswithanend of bundlemark. TheTriMediaVL IWCentralProcessingUnit TheTriMediaprocessorwasdesignedbyPhilips,theDutchelectronicscompanythat alsodesignedtheCD,andCD ROM(Ref.4).Itisdesignedformedia intensive applications,suchasimageprocessing,CDandDVDrecordersorpla yers,digitalvideo cameras,digitaltelevisionsets,etc.TheTriMediaisatrueVLIWprocessor. Eachmachinelanguageinstructioncommonlyspecifiesfiveoperations.Themachine wordisdividedintofiveslots,oneperoperationtobeissued.Eachslo tcommandsone ormorefunctionunits;sothatsomeslotsarespecialpurpose. HereistheformatofatypicalTriMediamachineinstruction. TheTMS3260implementationrunsat250MHz.Sinceitcanissuefiveoperationsper clockcycle,ithasaneffe ctivemaximumratingof1250MIPS. TheTriMediahasabyte orientedmemory.Itusesmemory mappedI/O,inwhicheach I/Odeviceis accessedthroughregistersmappedintothememoryaddressspace. TheTriMediaProcessors HereisatabletakenfromtheWik ipediaarticleon thehistoryofTriMediaprocess ors. Core year 1st silicon ISA Features Cache(I/D) KB frequency (worstcase) introduction technology TM1000 1997 TMA0 32/16 100 MHz 500nm TM1100 1998 TMA1 32/16 133MHz 350nm TM1300 1999 TMA1 32/16 166MHz 250nm TM3260 2002 TMA2 binarycompatiblewithTM1300 64/16 250MHz 130nm TM5250 2004 TMA3 128 KBL2datacache, allocateonwritemiss,hardwareprefetching,super pipelin ed(highspeed) 64/16 450MHz 130nm TM2270 2006 TMA3 96GPRs(smallarea) 32/16 290MHz 90nm TM3270/1 2006 TMA4+ ASE lowpower 64/128 64/3232/16 350MHz 90nm TheTanenbaumtextbookisbasedontheTM3260.Notethesuccessorprocessors. 1. TheTM 5250,operatingat450MHz.Itismorepowerful. 2. TheTM2270andTM3270,designedtobesmalland/orlowinpowerconsumption. Thetwocommonmarketpressuresarehighperformanceandlowpowerusage. TheTriMediaCPU:Details TheCPUha s128general purposeregister s,eachholdinga32 bitnumber.Twoofthe registersstoreconstantvalues:R0stores0andR1stores1.Allothersaregeneral purposeandcanstoreintegers(8,16,or32bits)orIEEE 754floatingpointvalues. TheTMS3260has12func tionalunits,acontrolunitandelevenfordoingarithmetical, logical,andcontrolflowoperations.Someoftheseunitsrespondonlytoinstructionsin specificinstructionslots;otherscanbecommandedfromanyinstructionslot. Thelatencyisthen umberofstepstomovearesultthroughthefunctionalunit. Thelastfivecolumnsshowtheplacementofcommandsforeachfunctionalunit. TheTriMediaCPU:MathematicalUnits Thestandardarithmeticunitsusethetwos complementstandardforintegerar ithmetic, buttheDSP(DigitalSignalProcessor)unitsusesaturationarithmetic. In saturationarithmetic ,anoperationthatproducesaresultnotrepresentabledueto overflowsaturatesatthemaximumvalueratherthangeneratinganexception. Forexampl e,therangeofnumbersrepresentableby8 bitunsignedintegerarithmeticis 0through255inclusive.Insaturationarithmetic,180+180=255,themaximumvalue. Withtwominorexceptions,alloperationsintheTriMediaare predicated . Inapredicated instruction,eachoperationspecifiesaregisterthatistobetestedbefore theoperationisexecuted.Thelow orderbitoftheregisterisexamined. 0. Ifthatbitis0,theoperationisskipped. 1. Ifthatbitis1,theoperationisexecuted. IFR2 IADDR4,R5 R8 //AddR4toR5andplaceresultintoR8. //Butonlyifbit0ofR2isa1;otherwisedonothing. UsingR1asthepredicateregistermakesitunconditionalasR1 1. UsingR0asapredicateregistermakesisano opasR0 0. HeterogeneousProcessorExample:TheDVDPlayer ThecomputercontrollingtheDVDplayerhasanumberofverydifferentfunctions. Eachoftheseisassignedtoaspecializedprocessor. Thisdesign usesmultiplecoresonasinglelargechip.A core isa largecircuit,suchasa CPU,I/Ocontroller,orcache,thatcanbeplacedonachipinamodularway.Some modernprocessorsare dual core inthattheyhavetwocores,eachbeingafullCPU. Thisdesignmightbecalled heterogeneousmulti core .Eacho ftheclosely coupled coreshasadedicatedfunctionrelatedtotheformatofthedataitmustprocess.This designwasfoundtobemoreeconomicalthanasinglegeneral purposeCPU. ComputersFromPieceParts Wenowfacetheissueofhowtodesigncom putersandtheirmajorcomponents. Maincomponents,suchastheCPU,willcontinuetobedesignedfrombasicgatesinthe traditionalwayforsometime.Heretheadvantageinperformancegainedfromasingle integrateddesignjustifiesthecostandeffort involved. Wenowhaveanotherattractiveoptionforthedesignofcomputingmachines.Thisone ismadeattractivebytheavailabilityofavarietyofcores,eachwithadedicatedfunction. Thiscollectionofcorescanbeconsideredessentiallyasaset oflibrariesoffunctions, onlythatthesefunctionsareimplementedinhardware. IBMhasproducedadesign,calledCoreConnect,whichisanarchitectureforconnecting coresonasingle chipheterogeneousmultiprocessor. Hereisanexample. Notethetw obusses;oneisfasterthantheother. TheGlobalInternetandtheNetworkInterfaceCard(NIC) YoumaythinkthatyourcomputerisconnectedtotheInternet,butitisnot.The computerisconnectedtoaNIC;itisthatNICthatisconnectedtotheInt ernet. TheNICisadedicatedI/Ocoprocessor,whichcommunicateswiththecomputersCPU viainterruptsandDMA(DirectMemoryAccess).ExceptwhentheNICisoperatedin promiscuousmode(fornetworksnooping),itfiltersallpacketsbyMACaddress. T hestandardoftransmissionthatweshalldiscussiscalled Ethernet .Packetsinthis protocolpossesstwo48 bitMAC(MediaAccessControl)addresses,oneforthesource andoneforthedestinationNIC(InterfaceCard). HereistheformatofanEtherne tpacketcontaininganIPpacket. TheEthernetheadercontainsthetwoMACaddresses.EachNIChasauniqueMAC addressassignedtoitunderaprotocoladministeredbytheIEEE.Innormaluse,theNIC willrecognizemessagessenttoitsMACaddressand passonlythosetotheCPU. TheNIC(NetworkProcessor) TheNICisprogrammabledevicethatcanhandleincomingandoutgoingpackets atthefullnetworkspeed.Itispluggedintoastandardslotinthecomputermotherboard. Oneormorenetworklinesco nnecttotheboardandareroutedtothenetworkprocessor. Mostsetupshaveonlyasinglenetworklineattached,butcomputersusedasswitches, routers,andthelikemusthaveatleasttwonetworklinesattached. Hereisadiagramofatypicalnetworkpr ocessor,usingaPCIslotonthemotherboard. NotethemultiplePPE(PacketProcessingEngines).Eachisaspecializedcorewitha dedicatedtask;thesetformsapacketprocessingpipeline. TheNexperiaMediaProcessor Ordinarygeneral purposeprocess orsarenotespeciallygoodatthemassivelyparallel computationsrequiredtoprocesshigh resolutionaudioandvideostreams. TheNexperiaisasingle chipheterogeneousmultiprocessordesignedbyPhilips,using itsTriMediachip.Itcomprisesaheterog eneouscollectionofcores,eachwitha dedicatedfunctionforwhichithasbee noptimized.HereisthePNX1500. MoreontheNexperia TheNexperiaisdesignedforuseeitherasacoprocessorinaPCorasastand alonemain processorinanappliancesu chasaDVDplayer,digitalTVset,videocamera,etc. OtherthantheSRAMandSDRAMinternaltotheTriMediaprocessor,theNexperia containsnomainmemoryonthechip.ThePNX1500implementationhasaninterfaceto externalmemory,allowingfor8to2 56MBofDDRSDRAM. Thewidthofthememoryinterfaceis32bits(4bytes).ThisallowstheDDRmemoryto transfer8bytesperclockpulse;at200MHzthedatarateis1.6GB/second. Theprocessingunits(DVDDescrambler,LengthDecoder,VideoScaler,and Graphics Engine)performcomputationsrelatedtothedisplayofencryptedvideoasfoundona commercialDVD. Notethatthereisacorededicatedtodebugging.ItfollowstheJTAG(JointTestAction Group) protocols,definedinIEEEStandard1149.1 the industrystandard. AHigh EndGraphicsCoprocessor HerearesomedataontheNVIDIAGeForce9Series(9800GX2and9800GTX).The tableistakenfromthewebsite(Ref.6). Core Clock (MHz) Shader Clock (MHz) Memory Clock (MHz) Memory Amount Memory In terface Memory Bandwidth (GB/sec) Texture FillRate (billion/sec) 9800 GX2 600 1500 1000 1GB 512 - bit 128 76.8 9800 GTX 675 1688 1100 512MB GDDR3 256 - bit 70.4 43.2 9600 GT 650 1625 900 512MB 256 - bit 57.6 20.8 The9800GX2isamulti coredesignwith25 6streamprocessors.Ithasa512bit (64byte)memoryinterfaceoperatingatapeakrateof128gigabytespersecond. Thisproducesvideoatresolutionsupto2560by1600pixels. Thecostofthe9800GX2is$520(Ref.6,4/16/2008). AHigh EndAudioP rocessor HerearesomedataontheSoundBlaster XtremeGamerFatal1tyProSeries. ItisanaudioattachedcoprocessorforusewithaPC. 24 bitAnalogtoDigitalconversion 96kHzsamplerate 24 bitDigitaltoAnalogconversion 96kHzratetoeither7.1aud ioorstandardstereo. 64MBrandomaccessmemory,calledXRAM. Signal to NoiseRatio 109dBforstereooutput TotalHarmonicDistortion 0.004% FrequencyResponse 10Hzto46kHz( 3dBpoints) Note: Theseaudiospecificationswouldbeconsideredext remelygood forahigh pricedaudiosystemforhomeuse. Thecostofthiscoprocessoris$150.00(Ref.7,4/16/2008) CryptographicCoprocessors SupposetwoworkstationsthataretocommunicateoverthepublicInternetinasecure mode.Theprovisiono findustrial gradecryptographyisverycomputeintensive. Again,cryptographydoesnotlenditselftosolutionbyageneral purposeprocessor.For thisreason,andalsotooffloadthecomputationalburdenfromtheprimaryCPU,many securecommunications ystemsuseattachedcryptographicprocessors. HerearesomedataonacryptographicprocessormarketedbyIBM(Ref.8).Theproduct describedistheIBMPCICryptographicCoprocessor. ThecoprocessorprovidesDES, triple DES, RSA,andDSAencryption,all national standards. ThehardwareiscertifiedunderFIPSPUB140 1 (SecurityRequirementsfor CryptographicModules),atlevel3.Themainframeversioniscertifiedtolevel4. Thecoprocessorhasatamper sensingandtamper respondingenvironmenttol imit andreportunauthorizedaccesstotheprocessoritself. Thepriceofthisunitwasnotquoted. GameEnginesasSupercomputers Itmaysurprisestudentstolearnthatmanyofthesehigh endgraphicsprocessorsare actuallyexportcontrolledasmunition s.Inthiscase,thecontrolisduetothepossibility ofusingtheseprocessorsashigh performancecomputers. Inthenextslide,wepresentahigh endgraphicscoprocessorthatcanbeviewedasa vectorprocessor.Itiscapableofasustainedrateof4 ,300Megaflops. ComparethistotheCRAY 1supercomputerof1976,withasustainedcomputing rateof136Megaflopsandapeakrateof250Megaflops.Thisisabout3.2%ofthe performanceofthecurrentgraphicscoprocessoratabout500timesthecost. The CrayY MPwasasupercomputersoldbyCrayResearchbeginningin1988. Itspeakperformancewas2.66Gigaflops(8processorsat333Megaflopseach). Itsmemorycomprised128,256,or512MBofstaticRAM. Theearliestsupercomputerthatcouldoutperformt hecurrentgraphicsprocessorseemsto havebeentheCrayT3E 1200E , aMPP(MassivelyParallelProcessor) introducedin 1995 (Ref.9) . In1998,ajointscientificteamfromOakRidgeNationalLab,the UniversityofBristol(UK)andothersranasimulationrelatedtocontrolledfusionata sustainedrateof1.02Teraflop s(1020Gigaflops). Thenextslideshowsthiscurrentgraphicscoprocessor. TheNVIDIATeslaC870 DataherearefromtheNVIDIAwebsite(Ref.6).Iquotefromtheiradvertisingcopy. TheC870processorisamassively multi threadedprocessorarchitect ure thatisidealforhighperformance computing(HPC)applications. Thishas128processorcores,each operatingat1.35GHz. Itsupportsthe IEEE 754single precisionstandard, andoperatesatasustainedrateof430 gigaflops(512GFlopspeak). Thety picalpowerusageis120watts. Notethededicatedfanforcooling. Thepriceis$1300,withanintroductory offerat$650. Theprocessorhas1.5gigabytesofDDRSDRAM,operatingat800MHz.Thedatabus tomemoryis384bits(48bytes)wide,sothatth emaximumsustaineddatarateis 482800106=76.8Gigabytespersecond. References Inthislecture,materialfromoneormoreofthefollowingreferenceshasbeenused. 1. ComputerOrganizationandDesign ,DavidA.Patterson&JohnL.Hennessy, Morgan Kaufmann,(3 rd Edition,RevisedPrinting)2007,(Thecoursetextbook) ISBN978 0 12 370606 5. 2. ComputerArchitecture:AQuantitativeApproach ,JohnL.Hennessyand DavidA. Patterson,MorganKauffman,1990.Thereisalateredition. ISBN1 55860 069 8. 3. High PerformanceComputerArchitecture ,HaroldS.Stone, Addison Wesley(ThirdEdition),1993.ISBN0 201 52688 3. 4. StructuredComputerOrganization ,AndrewS.Tanenbaum, Pearson/Prentice Hall(FifthEdition),2006.ISBN 0 13 148521 0 5. http://en.wikipedia.org/wiki/TriMedia 6. http://www.nvidia.com 7. http://www.soundblaster.com 8. http://www - 03.ibm.com/security/cryptocards/pcicc.shtml 9. http://www.cray.com