ExampleBeforeaddr3r34afteraddrob6 r34 ddr4r7r3addrob7r7rob6addr3r2r7addrob8r2r7Assumereorderbufferisinitiallyatposition6andhasmorethan8slotsThemappingtableindicatesthecorrespondencebetween ID: 414531
Download Pdf The PPT/PDF document "ReorderBuffer:registerrenamingandin-orde..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
ReorderBuffer:registerrenamingandin-ordercompletionUseofareorderbufferAtissue(renamingtime),aninstructionisassignedanentryatthetailofthereorderbuffer(ROB)whichbecomesthenameof(orapointerto)theresultregister.Atendoffunctional-unitcomputation,valueisputintheinstructionreorderbufferspositionWhentheinstructionreachestheheadofthebuffer,itsvalueisstoredinthelogicalorphysical(otherreorderbufferentry)register.NeedofamappingtablebetweenlogicalregistersandROBentries ExampleBefore:addr3,r3,4afteraddrob6 r3,4 ddr4,r7,r3addrob7,r7,rob6addr3,r2,r7addrob8,r2,r7Assumereorderbufferisinitiallyatposition6andhasmorethan8slotsThemappingtableindicatesthecorrespondencebetweenROBentriesandlogicalregisters DatadependencieswithregisterrenamingRegisterrenamingdoesnotgetridofRAWdependenciesStillneedforforwardingorforindicatingwhetheraregisterhasreceiveditsvalueRegisterrenaminggetsridofWAWandWARdependenciesThereorderbuffer,asitsnameimplies,canbeusedforin-ordercompletion MoreonreorderbufferTomasulosschemecanbeextendedwiththepossibilityofcompletinginstructionsinorderReorderbufferentrycontains(thisisnottheonlypossiblesolution)Typeofinstruction(branch,store,ALU,orload)Destination(none,memoryaddress,registerincludingotherROBentry)Valueanditspresence/absenceReservationstationtagsandtrueregistertagsarenowidsofentriesinthereorderbuffer Examplemachinerevisited(Fig2.14(3.29)) From I - Needfor4stagesInTomasulossolution3stages:issue,execute,writeNow4stages:issue,execute,write,commitDispatchandIssueCheckforstructuralhazards(reservationstationsbusy,reorderbufferfull).Ifoneexists,stalltheinstructionandthosefollowingIfdispatchpossible,sendsourceoperandvaluestoreservationstationifthevaluesareavailableineithertheregistersorthereorderbuffer.Otherwisesendtag.Allocateanentryinthereorderbuffer(renameresultregister)andsenditsnumbertothereservationstation(tobeusedasatagonCDB)Whenbothoperandsareready,issuetofunctionalunit Needfor4stages(ced)ExecuteWriteBroadcastoncommondatabusthevalueandthetag(reorderbuffernumber).Reservationstations,ifanymatchthetag,andreorderbuffer(always)grabthevalue.CommitWheninstr.atheadofthereorderbufferhasitsresultinthebufferitstoresitintherealregister(forALU)ormemory(forstore).Thereorderbufferentry(and/orphysicalregister)isfreed. 3 4 Sub F8, F6, F2 yes Reservation StationsVjQjQkAdd2 yes Add (#4) (#2) ) (#3) yes yes3 Reservation StationsVjQjQkno(#4)(#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 Reservation StationsVjQjQkno(#2) (#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 yesReservation StationsVjQjQknoAdd2 no(#3)F2( ) F4 ( ) F6(#6 ) F8 (#4) F10 (#5) F12... yes yes3 yesReservation StationsVjQjQknoAdd2 nono(#1) yes yes3 yes yesReservation StationsVjQjQknoAdd2 nono(#1) yes yes3 yes yesReservation StationsVjQjQknoAdd2 nono(#1) egisterrenamingPhysicalRegisterfileUseaphysicalregisterfile(asanalternativetoreservationstationorreorderbuffer)largerthantheISAlogicaloneWheninstructionisdecodedGiveanewnametoresultregisterfromfreelist.TheregisterisThemappingtableisupdatedGivesourceoperandstheirphysicalnames(frommappingtable) egisterrenamingFileofphysicalregistersExtrasetofregistersorganizedasafreelistAtdecode:Whenaphysicalregisterhasbeenreadforthelasttime,returnittothefreelistwhen instruction uses physical register as operand; release when counter is 0)Simpler to wait till logical register has been assigned a new na xampleBefore:addr3,r3,4afteraddr37 r3,4 ddr4,r7,r3addr38,r7,r37addr3,r2,r7addr39,r2,r7Freelistr37,r38,r39 .Atthispointr3isr2,r3,r4,r7notrenamedyetremappedfromr37tor39Whenr39commits,r37willbere urnedtothe reelist onceptualexecutiononaprocessorwhichexploitsILPInstructionfetchandbranchpredictionInstructiondecode,dependencecheck,dispatch,issueorder InstructionexecutionInstructioncommit(forOOOonly)- ultipleIssueAlternativesSuperscalar(hardwaredetectsconflicts)Staticallyscheduled(inorderdispatchandhenceexecution;cf.(DEC)Alpha21164,SunprocessorinNiagara,IBMCellSynergeticProcessor)Dynamicallyscheduled(inorderissue,outoforderdispatchandexecution;cf.MIPS10000,IBMPower4and5,IntelPentiumP6microarchitecture,AMDK5etal,(DEC)Alpha21264,SunUltraSparcetc.)VLIWEPIC(ExplicitlyParallelInstructionComputing)Compilergeneratesbundlesofinstructionsthatcanbeexecutedconcurrently(cf.IntelItanium,lotofDSPs) ultipleIssueforStatic/DynamicSchedulingIssueinorderCheck for structural hazards; if any stallDispatchforstaticschedulingCan take forwarding into accountDispatchfordynamicschedulingDispatch out of order (reservation stations, instruction window)Requires possibility of dispatching concurrently dependent instr mpactofMultipleIssueonIFIF:Needtofetchmorethan1instructionatatimeSimplerifinstructionsareoffixedlengthInfactneedtofetchasmanyinstructionsastheissuestagecanhandleinonecycleSimplerifrestrictednottooverlapI-cachelinesButwithbranchprediction,thisisnotrealistichenceintroductionof(instruction)fetchbuffersandtracecachesAlwaysattempttokeepatleastasmanyinstructionsinthefetchbufferascanbeissuedinthenextcycle(BTBshelpforthat)Forexample,havean8wideinstructionbufferforamachinethatcanissue4instructionspercycle tallsattheIFStageInstructioncachemissInstructionbufferisfullMostlikelytherearestallsinthestagesdownstreamBranchmispredictionInstructionsarestoredinseveralI-cachelinesInonecycleoneI-cachelinecanbebroughtintofetchbufferAbasicblockmightstartinthemiddle(orend)ofanI-cachelineRequiresseveralcachelinestofillthebufferTheID(issue-dispatch)stagewillstallifnotenoughinstructionsinthefetchbuffer ampleofOldandCurrentMicrosTwoinstructionissue:Alpha21064,Sparc2,Pentium,CyrixThreeinstructionissue:PentiumP6(but5uopsfromIF/IDtoEX;Pentium4andAMDK7have4uops,IntelCorehas6uops)Fourinstructionissue:Alpha21164,Alpha21264,IBMPower4andPower5(butsomewhatrestricted),SunUltraSparc,HPPA-8000,MIPSR10000Manypaperswritteninmid-90spredicted16-wayissueby2000.Wearestillat4in2007! heDecodeStage(simplecase:dualissueandstaticscheduling)ID=Dispatch+Issue!Lookforconflictsbetweenthe(say)2instructionsIfoneintegerop.andonef-pop.,onlycheckforstructuralhazard,i.e.thetwoinstructionsneedthesamef-u(easytocheckwithopcodes)RAWdependenciesresolvedasinsinglepipelinesNotethattheloaddelay(assume1cycle)cannowdelayupto3instructions,i.e.,3issueslotsarelost ecodeinSimpleMultipleIssueCaseIfinstructionsiandi+1arefetchedtogetherand:Instructionistalls,instructioni+1willstallInstructioniisdispatchedbutinstructioni+1stalls(e.g.,becauseofstructuralhazard=needthesamef-u),instructioni+2willnotadvancetotheissuestage.Itwillhavetowaittillbothiandi+1havebeendispatched lpha21164(@1995)4-wide - - ipeline. lpha21164Front-endIFS0:AccessI-cacheIF-S1:BranchPredictioncache + static prediction BTFNTID-S2:Slottingor. ID-S3. lpha21164Restrictionsinfront-endInintegerprograms,only2arithmeticinstructionscanpassfromS2toS3(structuralhazards)Thispercolatesback .InS0,onlyinstructionsinthesamecachelinecanbefetchedinagivencycleToobadifyoubranchinthemiddleofacacheline TargetbranchaddresscomputedinS1Soifpredicttaken,youhaveonebubble.GoodchanceitwillbeamortizedbyothereffectsdownstreamS3usestheequivalentofa(simplified)scoreboard lpha21164-Back-endLoadlatency:2cyclesScoreboarddoesnotknowifcachehitormissy in the Onbranchmispredict(andprecise)exceptionsKnown at S5. All inst. in program order after the branch are aboOtherpossiblestructuralhazardsduetostorebuffersetc.(seelater)WhathappensonaD-TLBmiss? ynamicScheduling:Reservationstations,registerrenamingandreorderbufferDecodemeans:DispatchtoeitherAcentralizedinstructionwindowcommontoallfunctionalunits(PentiumPro,PentiumIIIandPentium4)Reservationstationsassociatedwithfunctionalunits(MIPS10000,AMDK5-7,IBMPower4andPower5)Renameregisters(eitherviaROBorphysicalfile)NotethedifficultywhenrenaminginthesamecycleR1-R2+R3;R4-R1+R5Setupentryattailofreorderbuffer(ifsupportedbyarchitecture)Issueoperands,whenready,tofunctionalunit tallsinDecode(issue/dispatch)StageIftherearedecentralizedreservationstations,therecanbeseveralinstructionsreadytobedispatchedinsamecycletosamefunctionalunitPossibilityofnotenoughreservationstationsIfthereisacentralizedinstructionwindow,theremightnotbeenoughbus/portstoforwardvaluestotheexecutionunitsthatneedtheminthesamecycleBothinstancesareinstancesofstructuralhazardsConflictsareresolvedviaalgorithmTryanddefinenstructions heExecuteStageUseofforwardingUseofbroadcastbusorcross-barorotherinterconnectionnetworkWelltalkatlengthaboutmemoryoperations(load-store)insubsequentlectureandwhenwestudymemoryhierarchies heCommitStep(in-ordercompletion)Recall:needofamechanism(reorderbuffer)to:Completeinstructionsinorder.Thiscommitstheinstruction.Sincemultipleissuemachine,shouldbeabletocommit(retire)severalinstructionspercycleKnowwhenaninstructionhascompletednon-speculatively,i.e.,whattodowithbranchesKnowwhethertheresultofaninstructioniscorrect,i.e.,whattodowithexceptions mpactonBranchPredictionandCompletionWhenaconditionalbranchisdecoded:Savethecurrentphysical-logicalmappingPredictandproceedWhenbranchisreadytocommit(headofbuffer)Ifpredictioncorrect,discardthesavedmappingIfpredictionincorrectNotethattherehavebeenproposalstoexecutebothsidesofabranchusingregistershadowslimitedtooneextrasetofregisters xceptionsInstructionscarrytheirexceptionstatusWheninstructionisreadytocommitNoexception:proceednormallyException ummary:OOOflowofinstructions---Back- entiumFamily(slightlymoredetailsinH&PSec2.10(3.10in3)Fetch-DecodeunitTransformsupto3instructionsatatimeintomicro-operations(uops)andstorestheminaglobalreservationtable(instructionwindow).Doesregisterrenaming(RAT=registeraliastable)Dispatch(akaissue)-executionunitIssuesuopstofunctionalunitsthatexecutethemandtemporarilystoretheresultsDepending on the implementation from 3 to 6 RetireunitCommitstheinstructionsinorder(upto3commits/cycle) Dispath. he3unitsofthePentiumP6areindependentandcommunicatethroughtheinstructionpool FewMoreDetails:Front-endInstructionFetch(notinPentium4)InstructionDecode ront-end(ctd)RegisterrenamingEnterµ psinreservationstationsandROB ack-endµ pscangetexecutedwhenOperandsareavailableTheExecutionUnitforthatµ pisavailableAresultbuswillbeavailableatcompletionNomoreimportantµ pshouldbeexecutedSoittakestwocycle(pipestages)todoallthat.Then:µ psareexecutedWellseeaboutload-storelaterCommit(akaretire)Allµ psfromthesameinstructionshouldberetiredtogether(donebymarkingbeg.AndendofinstructionswhenputintheROB) imitstoHardware-basedILPInherentlackofparallelisminprogramsPartialremedy:loopunrollingandothercompileroptimizationsBranchpredictiontoallowearlierissueanddispatchComplexityinhardwareNeedslargebandwidthforinstructionfetch(mightneedtofetchfrommorethanoneI-cachelineinonecycle)Requireslargeregisterbandwidth(multiportedregisterfiles)Forwarding/broadcastrequireslongwires(longwiresareslow)assoonastherearemanyunits. imitstoHardware-basedILP(ced)DifficultiesspecifictotheimplementationMorepossibilitiesofstructuralhazards(needtoencodesomeprioritiesincaseofconflictinresourceallocations)Parallelsearchinreservationstations,reorderbufferetc.Additionalstatesavingsforbranches(mappings),morecomplexupdatingofBPTsandBTBs.Keepingpreciseexceptionsismorecomplex