/
ReorderBuffer:registerrenamingandin-ordercompletion ReorderBuffer:registerrenamingandin-ordercompletion

ReorderBuffer:registerrenamingandin-ordercompletion - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
391 views
Uploaded On 2016-07-22

ReorderBuffer:registerrenamingandin-ordercompletion - PPT Presentation

ExampleBeforeaddr3r34afteraddrob6 r34 ddr4r7r3addrob7r7rob6addr3r2r7addrob8r2r7Assumereorderbufferisinitiallyatposition6andhasmorethan8slotsThemappingtableindicatesthecorrespondencebetween ID: 414531

ExampleBefore:addr3 4afteraddrob6 r3 4 ddr4 r3addrob7 rob6addr3 r7addrob8 r7Assumereorderbufferisinitiallyatposition6andhasmorethan8slotsThemappingtableindicatesthecorrespondencebetween

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ReorderBuffer:registerrenamingandin-orde..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ReorderBuffer:registerrenamingandin-ordercompletion•Useofareorderbuffer–Atissue(renamingtime),aninstructionisassignedanentryatthetailofthereorderbuffer(ROB)whichbecomesthenameof(orapointerto)theresultregister.–•Atendoffunctional-unitcomputation,valueisputintheinstructionreorderbuffer’sposition•Whentheinstructionreachestheheadofthebuffer,itsvalueisstoredinthelogicalorphysical(otherreorderbufferentry)register.•NeedofamappingtablebetweenlogicalregistersandROBentries ExampleBefore:addr3,r3,4afteraddrob6 r3,4 ddr4,r7,r3addrob7,r7,rob6addr3,r2,r7addrob8,r2,r7Assumereorderbufferisinitiallyatposition6andhasmorethan8slotsThemappingtableindicatesthecorrespondencebetweenROBentriesandlogicalregisters Datadependencieswithregisterrenaming•RegisterrenamingdoesnotgetridofRAWdependencies–Stillneedforforwardingorforindicatingwhetheraregisterhasreceiveditsvalue•RegisterrenaminggetsridofWAWandWARdependencies•Thereorderbuffer,asitsnameimplies,canbeusedforin-ordercompletion Moreonreorderbuffer•Tomasulo’sschemecanbeextendedwiththepossibilityofcompletinginstructionsinorder•Reorderbufferentrycontains(thisisnottheonlypossiblesolution)–Typeofinstruction(branch,store,ALU,orload)–Destination(none,memoryaddress,registerincludingotherROBentry)–Valueanditspresence/absence•Reservationstationtagsand“true”registertagsarenowidsofentriesinthereorderbuffer Examplemachinerevisited(Fig2.14(3.29)) From I - Needfor4stages•InTomasulo’ssolution3stages:issue,execute,write•Now4stages:issue,execute,write,commit•DispatchandIssue–Checkforstructuralhazards(reservationstationsbusy,reorderbufferfull).Ifoneexists,stalltheinstructionandthosefollowing–Ifdispatchpossible,sendsourceoperandvaluestoreservationstationifthevaluesareavailableineithertheregistersorthereorderbuffer.Otherwisesendtag.–Allocateanentryinthereorderbuffer(renameresultregister)andsenditsnumbertothereservationstation(tobeusedasatagonCDB)–Whenbothoperandsareready,issuetofunctionalunit Needfor4stages(c’ed)•Execute•Write–Broadcastoncommondatabusthevalueandthetag(reorderbuffernumber).Reservationstations,ifanymatchthetag,andreorderbuffer(always)grabthevalue.•Commit–Wheninstr.atheadofthereorderbufferhasitsresultinthebufferitstoresitintherealregister(forALU)ormemory(forstore).Thereorderbufferentry(and/orphysicalregister)isfreed. 3 4 Sub F8, F6, F2 yes Reservation StationsVjQjQkAdd2 yes Add (#4) (#2) ) (#3) yes yes3 Reservation StationsVjQjQkno(#4)(#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 Reservation StationsVjQjQkno(#2) (#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 yesReservation StationsVjQjQknoAdd2 no(#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 yesReservation StationsVjQjQknoAdd2 nono(#1) yes yes3 yes yesReservation StationsVjQjQknoAdd2 nono(#1) yes yes3 yes yesReservation StationsVjQjQknoAdd2 nono(#1) egisterrenaming–PhysicalRegisterfile•Useaphysicalregisterfile(asanalternativetoreservationstationorreorderbuffer)largerthantheISAlogicalone•Wheninstructionisdecoded–Giveanewnametoresultregisterfromfreelist.TheregisterisThemappingtableisupdated–Givesourceoperandstheirphysicalnames(frommappingtable) egisterrenaming–Fileofphysicalregisters•Extrasetofregistersorganizedasafreelist•Atdecode:–Whenaphysicalregisterhasbeenreadforthelasttime,returnittothefreelist–when instruction uses physical register as operand; release when counter is 0)–Simpler to wait till logical register has been assigned a new na xampleBefore:addr3,r3,4afteraddr37 r3,4 ddr4,r7,r3addr38,r7,r37addr3,r2,r7addr39,r2,r7Freelistr37,r38,r39….Atthispointr3isr2,r3,r4,r7notrenamedyetremappedfromr37tor39Whenr39commits,r37willbere urnedtothe reelist onceptualexecutiononaprocessorwhichexploitsILP•Instructionfetchandbranchprediction–Instructiondecode,dependencecheck,dispatch,issue–order Instructionexecution–Instructioncommit(forOOOonly)–- ultipleIssueAlternatives•Superscalar(hardwaredetectsconflicts)–Staticallyscheduled(inorderdispatchandhenceexecution;cf.(DEC)Alpha21164,SunprocessorinNiagara,IBMCellSynergeticProcessor)–Dynamicallyscheduled(inorderissue,outoforderdispatchandexecution;cf.MIPS10000,IBMPower4and5,IntelPentiumP6microarchitecture,AMDK5etal,(DEC)Alpha21264,SunUltraSparcetc.)•VLIW–EPIC(ExplicitlyParallelInstructionComputing)–Compilergenerates“bundles“ofinstructionsthatcanbeexecutedconcurrently(cf.IntelItanium,lotofDSP’s) ultipleIssueforStatic/DynamicScheduling•Issueinorder–Check for structural hazards; if any stall•Dispatchforstaticscheduling–Can take forwarding into account•Dispatchfordynamicscheduling–Dispatch out of order (reservation stations, instruction window)–Requires possibility of dispatching concurrently dependent instr mpactofMultipleIssueonIF•IF:Needtofetchmorethan1instructionatatime–Simplerifinstructionsareoffixedlength–Infactneedtofetchasmanyinstructionsastheissuestagecanhandleinonecycle–SimplerifrestrictednottooverlapI-cachelines–Butwithbranchprediction,thisisnotrealistichenceintroductionof(instruction)fetchbuffersandtracecaches–Alwaysattempttokeepatleastasmanyinstructionsinthefetchbufferascanbeissuedinthenextcycle(BTB’shelpforthat)–Forexample,havean8wideinstructionbufferforamachinethatcanissue4instructionspercycle tallsattheIFStage•Instructioncachemiss•Instructionbufferisfull–Mostlikelytherearestallsinthestagesdownstream•Branchmisprediction•InstructionsarestoredinseveralI-cachelines–InonecycleoneI-cachelinecanbebroughtintofetchbuffer–Abasicblockmightstartinthemiddle(orend)ofanI-cacheline–Requiresseveralcachelinestofillthebuffer–TheID(issue-dispatch)stagewillstallifnotenoughinstructionsinthefetchbuffer ampleofOldandCurrentMicros•Twoinstructionissue:Alpha21064,Sparc2,Pentium,Cyrix•Threeinstructionissue:PentiumP6(but5uopsfromIF/IDtoEX;Pentium4andAMDK7have4uops,IntelCorehas6uops)•Fourinstructionissue:Alpha21164,Alpha21264,IBMPower4andPower5(butsomewhatrestricted),SunUltraSparc,HPPA-8000,MIPSR10000•Manypaperswritteninmid-90’spredicted16-wayissueby2000.Wearestillat4in2007! heDecodeStage(simplecase:dualissueandstaticscheduling)•ID=Dispatch+Issue–!•Lookforconflictsbetweenthe(say)2instructions–Ifoneintegerop.andonef-pop.,onlycheckforstructuralhazard,i.e.thetwoinstructionsneedthesamef-u(easytocheckwithopcodes)–RAWdependenciesresolvedasinsinglepipelines–Notethattheloaddelay(assume1cycle)cannowdelayupto3instructions,i.e.,3issueslotsarelost ecodeinSimpleMultipleIssueCase•Ifinstructionsiandi+1arefetchedtogetherand:–Instructionistalls,instructioni+1willstall–Instructioniisdispatchedbutinstructioni+1stalls(e.g.,becauseofstructuralhazard=needthesamef-u),instructioni+2willnotadvancetotheissuestage.Itwillhavetowaittillbothiandi+1havebeendispatched lpha21164(@1995)4-wide - - ipeline. lpha21164–Front-end•IFS0:AccessI-cache–IF-S1:BranchPrediction–cache + static prediction BTFNT•ID-S2:Slotting–or. ID-S3.– lpha21164Restrictionsinfront-end•Inintegerprograms,only2arithmeticinstructionscanpassfromS2toS3(structuralhazards)–Thispercolatesback….•InS0,onlyinstructionsinthesamecachelinecanbefetchedinagivencycle–Toobadifyoubranchinthemiddleofacacheline…•TargetbranchaddresscomputedinS1–Soifpredicttaken,youhaveone“bubble”.Goodchanceitwillbeamortizedbyothereffectsdownstream•S3usestheequivalentofa(simplified)scoreboard lpha21164-Back-end•Loadlatency:2cycles–Scoreboarddoesnotknowifcachehitormiss–y in the •Onbranchmispredict(andprecise)exceptions–Known at S5. All inst. in program order after the branch are aboOtherpossiblestructuralhazardsduetostorebuffersetc.(seelater)•WhathappensonaD-TLBmiss? ynamicScheduling:Reservationstations,registerrenamingandreorderbuffer•Decodemeans:–Dispatchtoeither•Acentralizedinstructionwindowcommontoallfunctionalunits(PentiumPro,PentiumIIIandPentium4)•Reservationstationsassociatedwithfunctionalunits(MIPS10000,AMDK5-7,IBMPower4andPower5)–Renameregisters(eitherviaROBorphysicalfile)•NotethedifficultywhenrenaminginthesamecycleR1-R2+R3;R4-R1+R5–Setupentryattailofreorderbuffer(ifsupportedbyarchitecture)–Issueoperands,whenready,tofunctionalunit tallsinDecode(issue/dispatch)Stage•Iftherearedecentralizedreservationstations,therecanbeseveralinstructionsreadytobedispatchedinsamecycletosamefunctionalunit–Possibilityofnotenoughreservationstations•Ifthereisacentralizedinstructionwindow,theremightnotbeenoughbus/portstoforwardvaluestotheexecutionunitsthatneedtheminthesamecycle•Bothinstancesareinstancesofstructuralhazards–Conflictsareresolvedviaalgorithm–Tryanddefinenstructions heExecuteStage•Useofforwarding–Useofbroadcastbusorcross-barorotherinterconnectionnetwork•We’lltalkatlengthaboutmemoryoperations(load-store)insubsequentlectureandwhenwestudymemoryhierarchies heCommitStep(in-ordercompletion)•Recall:needofamechanism(reorderbuffer)to:–“Complete”instructionsinorder.Thiscommitstheinstruction.Sincemultipleissuemachine,shouldbeabletocommit(retire)severalinstructionspercycle–Knowwhenaninstructionhascompletednon-speculatively,i.e.,whattodowithbranches–Knowwhethertheresultofaninstructioniscorrect,i.e.,whattodowithexceptions mpactonBranchPredictionandCompletion•Whenaconditionalbranchisdecoded:–Savethecurrentphysical-logicalmapping–Predictandproceed•Whenbranchisreadytocommit(headofbuffer)–Ifpredictioncorrect,discardthesavedmapping–Ifpredictionincorrect•Notethattherehavebeenproposalstoexecutebothsidesofabranchusingregistershadows–limitedtooneextrasetofregisters xceptions•Instructionscarrytheirexceptionstatus•Wheninstructionisreadytocommit–Noexception:proceednormally–Exception• ummary:OOOflowofinstructions---Back- entiumFamily(slightlymoredetailsinH&PSec2.10(3.10in3)•Fetch-Decodeunit–Transformsupto3instructionsatatimeintomicro-operations(uops)andstorestheminaglobalreservationtable(instructionwindow).Doesregisterrenaming(RAT=registeraliastable)•Dispatch(akaissue)-executionunit–Issuesuopstofunctionalunitsthatexecutethemandtemporarilystoretheresults•Depending on the implementation from 3 to 6 Retireunit–Commitstheinstructionsinorder(upto3commits/cycle) Dispath. he3unitsofthePentiumP6are“independent”andcommunicatethroughtheinstructionpool FewMoreDetails:Front-end•InstructionFetch(notinPentium4)–InstructionDecode– ront-end(ctd)•Registerrenaming•Enterµ psinreservationstationsandROB ack-end•µ pscangetexecutedwhen–Operandsareavailable–TheExecutionUnitforthatµ pisavailable–Aresultbuswillbeavailableatcompletion–Nomore“important”µ pshouldbeexecuted–Soittakestwocycle(pipestages)todoallthat.Then:•µ psareexecuted–We’llseeaboutload-storelater•Commit(akaretire)–Allµ psfromthesameinstructionshouldberetiredtogether(donebymarkingbeg.AndendofinstructionswhenputintheROB) imitstoHardware-basedILP•Inherentlackofparallelisminprograms–Partialremedy:loopunrollingandothercompileroptimizations–Branchpredictiontoallowearlierissueanddispatch•Complexityinhardware–Needslargebandwidthforinstructionfetch(mightneedtofetchfrommorethanoneI-cachelineinonecycle)–Requireslargeregisterbandwidth(multiportedregisterfiles)–Forwarding/broadcastrequires“longwires”(longwiresareslow)assoonastherearemanyunits. imitstoHardware-basedILP(c’ed)•Difficultiesspecifictotheimplementation–Morepossibilitiesofstructuralhazards(needtoencodesomeprioritiesincaseofconflictinresourceallocations)–Parallelsearchinreservationstations,reorderbufferetc.–Additionalstatesavingsforbranches(mappings),morecomplexupdatingofBPT’sandBTB’s.–Keepingpreciseexceptionsismorecomplex

Related Contents


Next Show more