/
CSE 502: Computer Architecture CSE 502: Computer Architecture

CSE 502: Computer Architecture - PowerPoint Presentation

scarlett
scarlett . @scarlett
Follow
65 views
Uploaded On 2023-10-04

CSE 502: Computer Architecture - PPT Presentation

OutofOrder Memory Access Dynamic Scheduling Summary Outoforder execution a performance technique Feature I Dynamic scheduling iO OoO Performance piece rearrange insns ID: 1022462

store load memory loads load store loads memory insns ooo address stores queue lsq earlier ordering younger execute data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CSE 502: Computer Architecture" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CSE 502:Computer ArchitectureOut-of-Order Memory Access

2. Dynamic Scheduling SummaryOut-of-order execution: a performance techniqueFeature I: Dynamic scheduling (iO  OoO)“Performance” piece: re-arrange insns. for high perf.Decode (iO)  dispatch (iO) + issue (OoO)Two algorithms: Scoreboard, TomasuloFeature II: Precise state (OoO  iO)“Correctness” piece: put insns. back into program orderWriteback (OoO)  complete (OoO) + retire (iO)Two designs: P6, R10KOne remaining piece: OoO memory accesses

3. Executing Memory InstructionsIf R1 != R7Then Load R8 gets correct value from cacheIf R1 == R7Then Load R8 should get value from the StoreBut it didn’t!Load R3 = 0[R6]Add R7 = R3 + R9Store R4  0[R7]Sub R1 = R1 – R2Load R8 = 0[R1]IssueIssueCache Miss!IssueCache Hit!Miss serviced…IssueIssueBut there was a later load…

4. Memory Disambiguation ProblemOrdering problem is a data-dependence violationImprecise memory worse than imprecise registersWhy can’t this happen with non-memory insts?Operand specifiers in non-memory insns. are absolute“R1” refers to one specific locationOperand specifiers in memory insns. are ambiguous“R1” refers to a memory location specified by the value of R1. When pointers (e.g., R1) change, so does this location

5. Two ProblemsMemory disambiguation on loadsDo earlier unexecuted stores to the same address exist?Binary question: answer is yes or noStore-to-load forwarding problemI’m a load: Which earlier store do I get my value from?I’m a store: Which later load(s) do I forward my value to?Non-binary question: answer is one or more insn. identifiers

6. Load/Store Queue (1/3)Load/store queue (LSQ)Completed stores write to LSQWhen store retires, head of LSQ written to L1-DWhen loads execute, access LSQ and L1-D in parallelForward from LSQ if older store with matching address

7. Load/Store Queue (2/3)regfileL1-DI$BPROBLSQload/storestore dataaddrload dataAlmost a “real” processor diagram

8. Load/Store Queue (3/3)L0xF048417730x329042L/SPCSeqAddrValueS0xF04C417740x341025S0xF054417750x3290-17L0xF060417760x34181234L0xF840417770x3290-17L0xF858417780x33001S0xF85C417790x32900L0xF870417800x341025L0xF628417810x32900L0xF63C417820x33001OldestYoungest0x3290420x3410380x341812340x33001Data Cache25-17

9. In-order Memory (Policy 1/4)No memory reorderingLSQ still needed for forwarded data (last slide)Easy to scheduleReady!bidgrantbidgrantReady!1 (“head” pointer)……Fairly simple, but low performance

10. Loads OoO between Stores (Policy 2/4)Loads exec OoO w.r.t. each otherStores block everythingSreadyissuedLLSLS=0L=11 (“head” pointer)Still simple, but better performance

11. Stores Can be Split into STA/STDSTA: STore AddressSTD: STore DataMakes some designs easierRS/ROB store one valueStores need two (A & D)Storedispatch/allocSTASTDLD“store”“load”LSQRSscheduleAddLoad

12. Loads Wait for STAs Only (Policy 3/4)Only address is needed to disambiguateMay be ready earlier to allow checking for violationsNo need to wait for dataSLAddress readyData readyStill simple, even better performance

13. Loads Execute When Ready (Policy 4/4)Most aggressive approachRelies on fact that storeload forwarding is rareGreatest potential IPC – loads never stallPotential for incorrect executionNeed to be able to “undo” bad loadsVery complex, but high performance

14. Detecting Ordering Violations (1/2)Case 1: Older store execs before younger loadNo problem; if same address stld forwarding happensCase 2: Older store execs after younger loadStore scans all younger loadsAddress match  ordering violation

15. Detecting Ordering Violations (2/2)L0xF048417730x329042S0xF04C417740x341025S0xF054417750x3290-17L0xF060417760x34181234L0xF840417770x3290-17L0xF858417780x33001S0xF85C417790x32900L0xF870417800x341025L0xF628417810x329042L0xF63C417820x33001Store broadcasts value,address and sequence #(-17,0x3290,41775)Loads CAM-match onaddress, only care ifstore seq-# is lower thanown seq(Load 41773 ignores broadcast because it has a lower seq #)IF younger load hadn’t executed, andaddress matches, grab broadcasted valueIF younger load has executed, andaddress matches, then ordering violation!-17(0,0x3290,41779)An instruction may be involved inmore than one ordering violationL/SPCSeqAddrValueMust flush all later accesses after violation

16. Dealing with MisspeculationsLoads are not the only thing which are wrongLoads propagate wrong values to all dependentsThese must somehow be re-executedEasiest: flush all instructions after (and including?) the misspeculated load, and just refetchLoad uses forwarded valueCorrect value propagated when instructions re-execute

17. Flushing ComplicationsExactly same as mispredicted branchesCheckpoint at every load in addition to branchesVery large number of checkpoints neededRollback to previous branch (which has its own checkpoint)Make sure load doesn’t misspeculate on 2nd tryMust redo work between the branch and the loadCan work with undo-list style of recoveryNot all younger insns. are dependent on bad loadPipeline latency due to refetch is exposed

18. Selective Re-ExecutionRe-execute only the dependent insns.Ideal case w.r.t. maintaining high IPCNo need to re-fetch/re-dispatch/re-rename/re-executeVery complicatedNeed to hunt down only data-dependent insns.Some bad insns. already executed (now in ROB)Some bad insns. didn’t execute yet (still in RS)P4 does something like this (called “replay”)

19. LSQ Hardware in More DetailVery complicated CAM logic Need to quickly look up based on valueMay find multiple values / need age based searchNo need for age-based search in ROBPhysical regs. are renamed, guarantees one writerNo easy way to prevent multiple stores to same address

20. Loads Checking for Earlier StoresOn Load dispatch, find data from earlier StoreST 0x4000ST 0x4000ST 0x4120LD 0x4000=Address BankData Bank======0No earliermatchesAddr matchValid storeUse thisstoreNeed to adjust this so thatload need not be at bottom,and LSQ can wrap-aroundIf |LSQ| is large, logic can beadapted to have log delay

21. Similar Logic to Previous SlideData ForwardingOn execute Store (STA+STD), check for later LoadsST 0x4000ST 0x4120LD 0x4000Addr MatchIs LoadCaptureValueOverwrittenOverwrittenData BankThis is ugly, complicated, slow, and power hungryST 0x4000

22. Alternative Data Forwarding: Store ColorsEach store assigned unique number (its color)Loads inherit the color of the most recent storeStStStStLdLdLdLdColor=1Color=2Color=3Color=4LdAll three loads have same color:only care about ordering w.r.t.stores, not other loadsStLdLdLdIgnore store broadcastsIf store’s color > your own

23. Split Load Queue/Store QueueStores don’t need to broadcast address to storesLoads don’t need to check against earlier loadsStore Queue (STQ)Load Queue (LDQ)Associative search for earlier stores only needsto check entries that actually contain storesAssociative searchfor later loads forSTLD forwardingonly needs to checkentries that actuallycontain loads