Mitchell Haygena Vignayan Reddy and Mikko H Lipasti Padmini Gaur 13IS15F Sanchi 13IS20F Contents The Need Approaches and Isssues Revolver Some basics Loop Handling ID: 557882
Download Presentation The PPT/PDF document "Revolver: Processor Architecture for Pow..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Revolver: Processor Architecture for Power Efficient Loop Execution
Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti
Padmini
Gaur( 13IS15F)
Sanchi
(13IS20F)Slide2
Contents
The NeedApproaches and IsssuesRevolver: Some basicsLoop HandlingLoop DetectionDetection and Training
Finite State Machine
Loop Execution
Scheduler
Units
Tag Propagation Unit
Loop pre-execution
Conclusion
ReferencesSlide3
The Need
Per-transistor energy benefit improvementIncreasing computational efficiencyPower efficient mobile, serverIncreasing energy contraintsElimination of unnecessary pipeline activityManaging energy utilization
Small energy requirements of instruction execution but
Large control overheadsSlide4
So far: Approaches and Issues
Pipeline centric instruction cachingEmphasizing temporal instruction localityCapturing loop instruction in bufferInexpensive retrieval for future iterationsOut-of-order processors: Issues?Resource allocationProgram ordering
Operand dependencySlide5
Instructions serviced by Loop BufferSlide6
Energy Consumption
[Power Efficient Loop Execution Techniques: Mitchell Bryan
Hayenga
]Slide7Slide8
Revolver: An enhanced approach
Out-of-order back-endOverall design similar to normal processorNon-loop instructionsFollow normal conventional pipelineNo Register Allocation Table on front-end instead Tag propagation unit at back-endLoop mode:Detection and dispatching loop to back-endSlide9
The promises
No additional resource allocationEnergy consumption at front-end managedPre-execution of future iterationsOperand dependence linking moved to
back-endSlide10
Loop handling
Loop detectionTraining feedbackLoop executionWakeup logicTag Propagation UnitLoad Pre-executionSlide11
Loop Detection
Detection (at) stages:Post-executionAt decode stageEnabling loop mode at decodeCalculation of:Start addressRequired resourcesSlide12
Detection and Training
Key mechanisms:Detection logic at front-end -> dispatchedFeedback by back-end: Profitability of loopsProfitabilityDisabling future loop-modeDetection controlLoop Detection Finite State MachineSlide13
FSMSlide14
FSM states
Idle: Through decode until valid/profitable loop or PC-relative backward branch/jump detectionProfitability logged in Loop Address TableLAT records:Composition and profitabilityProfitable loop dispatchedBackward jump/ branch and No loopTrain StateSlide15
Train
state: Records start addressEnd addressAllowable unroll factorResources required added to LATLoop ends -> Idle stateIn dispatch state the decode logic guides the dispatch of loop instructions into the out of order backend.Slide16
Disabling
loop mode on:System callsMemory barriersLoad-store linked conditional pairSlide17
Training Feedback
Profitability4-bit counterDefault value =8Loop mode enabling if value>=8Dispatched loop unrolled more than twice, +2Else, -2Mis
-prediction other than fall-through, profitability set = 0
Disabled loops:
Front-end increments by 1 for 2 sequential successful dispatchSlide18
Loop: Basic idea
Unrolling loop:Depending on back-end resourcesAs much as possibleEliminating additional resource use after dispatchLoop instruction stays in issue queue, executes till completion of iterationMaintaining provided resources across multiple executions
L
oad-store
queues
modified maintaining program
orderSlide19
Contd..
Proper access of destination and source registerLoop exit:Removing instructions from back-endLoop fall-trough path dispatchSlide20
Loop execution: Let’s follow
Green: Odd numberedBlue: Even numberedPointers:Program order maintenance:
loop_start
,
loop_end
Oldest uncommitted entry: commitSlide21
Loop execution, contd..
Commiting:Start to endWrapping to start: next loop iterationResetting issue queue entries for next loop iterationLoad queue entries invalidatedStore queue entries:
Passed to write-buffer
Immediate reuse in next iteration
Cannot write to buffer -> stall (very rare)Slide22
Scheduler: Units
Wake-up arrayIdentifying Ready instructionsSelect logicArbitration between reading instructionsSilo instructionProducing the opcode and physical identifiers of selected instructionSlide23
Scheduler: The designSlide24
Scheduler: The concept
Managed as queueMaintains program order among entriesWakeup arrayUtilizes logical register identifiersPosition dependenceTag Propagation Unit (TPU)Physical register mappingSlide25
Wakeup Logic: Overview
Observes generated results:Identifying new instructions capable of being executedProgram based orderingBroadcast of logical register identifierNo need for renamingNo physical register identifier in useSlide26
Wakeup: The designSlide27
Wake up array
Rows: InstructionsColumns: Logical registersSignals:
Request
Granted
ReadySlide28
Wakeup operation
Allocation into wake up arrayMarking logical source and destination registersUnscheduled instructionDeassert downstream register columnPreventing younger, dependent instructions from waking upRequest sent when:
Receiving all necessary source register broadcasts
Ready source registersSlide29
Select grants the request:
Asserting downstream readyWaking up younger dependent instructionsWakeup logic cell:2 state bits: sourcing/producing logical registerSlide30
The simple logicSlide31
An example with dependenceSlide32
Tag Propagation Unit (TPU)
No renaming!Maps physical register identifier to logical registersEnables reuse of physical registerAs no additional resourcesPhysical register managementPossible speculative execution of next loop iterationsSlide33
Next loop iteration??
Impossible if:Instruction only has access to single physical destination registerSpeculative execution:Alternative physical register identifier neededSolution: 2 physical destination registersAlternative writing between 2Slide34
With 2 destination registers
Double BufferingMaintaining previous state while speculative computationN+1 commits, reusing destination register of iteration N on iteration N+2No instruction dependence in N and N+2Speculative writing in output register allowedSlide35
With Double buffering
Dynamic linkage between dependent instructions and source registersChanging logical register mappingOverwriting output register columnInstruction stored in program order:Downstream instructions obtain proper source mappingSlide36
Source, destination and iterationSlide37
Register reclamation
Any instruction misprediction:Flushing downstream instructionsPropagation of mappings to all newly scheduled instructionsBetter than RAT:Complexities reducedSlide38
Queue entries: Lifetime
Received prior to dispatchRetained till instruction exit from backendReused to execute multiple loop iterationsImmediate freeing of LSQ upon commitPosition based age logic in LSQLoad queue entries:Simply reset for futureSlide39
Store Queue entries: An extra effort
Need to write backDrained into write buffer immediately between L1 Cache and queueIf cannot write stallVery rareWrapping around of commit pointerSlide40
Loop pre-execution
Pre-execution of future loads:ParallelizationEnabling zero-latency loadsNo L1 cache access latencyRepeated execution of load till completion of all iterationsExploiting recurrent nature of loop:Highly predictable address patternsSlide41
Learning from example: String copy
Copying source array to destination arrayPredictable load addressAccessing consecutive bytes from memoryPrimary addressing access patterns:StrideConstantPointer-based
Placing simple pattern identification hardware alongside pre-executed load buffersSlide42
Stride based addressing
Most common Iterating over data arrayComputing address Δ between 2 consecutive loadsThird load matches predicted stride: Stride verificationPre-execution of next load
Constant: A special case of zero-sized stride
Reading from same address
Stack allocated variables/ Pointer aliasingSlide43
Pointer based addressing
Value returned by current load -> next address E.g. Linked list traversalsSlide44
Pre-execution: more..
Pre-executed load buffer placed between load queue and L1 Cache interfaceStore clashes with pre-executed loadInvalidating entryCoherency maintenancePre-executed loads:Speculatively waking up dependent operations on next cycle
Incorrect address prediction:
Scheduler cancels and re-issues operationSlide45
Conclusion
Minimizing energy during loop executionElimination of front-end overheads originating from pipeline activity and resource allocationBenefits achieved better than in loop buffers and μop cachesPre-execution increases performance during loop execution by hiding L1 cache latenciesAccording to research, 5.3-18.3% energy-delay benefitSlide46
References
Scheduling Reusable Instructions for Power Reduction (J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, M. Irwin),2004Matrix Scheduler Reloaded (P. G. Sassone
, J.
Rupley
, E.
Breckelbaum
, G. H.
Loh
, B. Black)
Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops (L. H. Lee, B. Moyer, J.
Arends
)
Power Efficient Loop Execution Techniques (Mitchell Bryan
Hayenga
)