/
Revolver: Processor Architecture for Power Efficient Loop E Revolver: Processor Architecture for Power Efficient Loop E

Revolver: Processor Architecture for Power Efficient Loop E - PowerPoint Presentation

alida-meadow
alida-meadow . @alida-meadow
Follow
400 views
Uploaded On 2017-06-09

Revolver: Processor Architecture for Power Efficient Loop E - PPT Presentation

Mitchell Haygena Vignayan Reddy and Mikko H Lipasti Padmini Gaur 13IS15F Sanchi 13IS20F Contents The Need Approaches and Isssues Revolver Some basics Loop Handling ID: 557882

execution loop instructions register loop execution register instructions instruction load pre detection queue address energy physical iteration state logic

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Revolver: Processor Architecture for Pow..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Revolver: Processor Architecture for Power Efficient Loop Execution

Mitchell Haygena, Vignayan Reddy and Mikko H. Lipasti

Padmini

Gaur( 13IS15F)

Sanchi

(13IS20F)Slide2

Contents

The NeedApproaches and IsssuesRevolver: Some basicsLoop HandlingLoop DetectionDetection and Training

Finite State Machine

Loop Execution

Scheduler

Units

Tag Propagation Unit

Loop pre-execution

Conclusion

ReferencesSlide3

The Need

Per-transistor energy benefit improvementIncreasing computational efficiencyPower efficient mobile, serverIncreasing energy contraintsElimination of unnecessary pipeline activityManaging energy utilization

Small energy requirements of instruction execution but

Large control overheadsSlide4

So far: Approaches and Issues

Pipeline centric instruction cachingEmphasizing temporal instruction localityCapturing loop instruction in bufferInexpensive retrieval for future iterationsOut-of-order processors: Issues?Resource allocationProgram ordering

Operand dependencySlide5

Instructions serviced by Loop BufferSlide6

Energy Consumption

[Power Efficient Loop Execution Techniques: Mitchell Bryan

Hayenga

]Slide7
Slide8

Revolver: An enhanced approach

Out-of-order back-endOverall design similar to normal processorNon-loop instructionsFollow normal conventional pipelineNo Register Allocation Table on front-end instead Tag propagation unit at back-endLoop mode:Detection and dispatching loop to back-endSlide9

The promises

No additional resource allocationEnergy consumption at front-end managedPre-execution of future iterationsOperand dependence linking moved to

back-endSlide10

Loop handling

Loop detectionTraining feedbackLoop executionWakeup logicTag Propagation UnitLoad Pre-executionSlide11

Loop Detection

Detection (at) stages:Post-executionAt decode stageEnabling loop mode at decodeCalculation of:Start addressRequired resourcesSlide12

Detection and Training

Key mechanisms:Detection logic at front-end -> dispatchedFeedback by back-end: Profitability of loopsProfitabilityDisabling future loop-modeDetection controlLoop Detection Finite State MachineSlide13

FSMSlide14

FSM states

Idle: Through decode until valid/profitable loop or PC-relative backward branch/jump detectionProfitability logged in Loop Address TableLAT records:Composition and profitabilityProfitable loop dispatchedBackward jump/ branch and No loopTrain StateSlide15

Train

state: Records start addressEnd addressAllowable unroll factorResources required added to LATLoop ends -> Idle stateIn dispatch state the decode logic guides the dispatch of loop instructions into the out of order backend.Slide16

Disabling

loop mode on:System callsMemory barriersLoad-store linked conditional pairSlide17

Training Feedback

Profitability4-bit counterDefault value =8Loop mode enabling if value>=8Dispatched loop unrolled more than twice, +2Else, -2Mis

-prediction other than fall-through, profitability set = 0

Disabled loops:

Front-end increments by 1 for 2 sequential successful dispatchSlide18

Loop: Basic idea

Unrolling loop:Depending on back-end resourcesAs much as possibleEliminating additional resource use after dispatchLoop instruction stays in issue queue, executes till completion of iterationMaintaining provided resources across multiple executions

L

oad-store

queues

modified maintaining program

orderSlide19

Contd..

Proper access of destination and source registerLoop exit:Removing instructions from back-endLoop fall-trough path dispatchSlide20

Loop execution: Let’s follow

Green: Odd numberedBlue: Even numberedPointers:Program order maintenance:

loop_start

,

loop_end

Oldest uncommitted entry: commitSlide21

Loop execution, contd..

Commiting:Start to endWrapping to start: next loop iterationResetting issue queue entries for next loop iterationLoad queue entries invalidatedStore queue entries:

Passed to write-buffer

Immediate reuse in next iteration

Cannot write to buffer -> stall (very rare)Slide22

Scheduler: Units

Wake-up arrayIdentifying Ready instructionsSelect logicArbitration between reading instructionsSilo instructionProducing the opcode and physical identifiers of selected instructionSlide23

Scheduler: The designSlide24

Scheduler: The concept

Managed as queueMaintains program order among entriesWakeup arrayUtilizes logical register identifiersPosition dependenceTag Propagation Unit (TPU)Physical register mappingSlide25

Wakeup Logic: Overview

Observes generated results:Identifying new instructions capable of being executedProgram based orderingBroadcast of logical register identifierNo need for renamingNo physical register identifier in useSlide26

Wakeup: The designSlide27

Wake up array

Rows: InstructionsColumns: Logical registersSignals:

Request

Granted

ReadySlide28

Wakeup operation

Allocation into wake up arrayMarking logical source and destination registersUnscheduled instructionDeassert downstream register columnPreventing younger, dependent instructions from waking upRequest sent when:

Receiving all necessary source register broadcasts

Ready source registersSlide29

Select grants the request:

Asserting downstream readyWaking up younger dependent instructionsWakeup logic cell:2 state bits: sourcing/producing logical registerSlide30

The simple logicSlide31

An example with dependenceSlide32

Tag Propagation Unit (TPU)

No renaming!Maps physical register identifier to logical registersEnables reuse of physical registerAs no additional resourcesPhysical register managementPossible speculative execution of next loop iterationsSlide33

Next loop iteration??

Impossible if:Instruction only has access to single physical destination registerSpeculative execution:Alternative physical register identifier neededSolution: 2 physical destination registersAlternative writing between 2Slide34

With 2 destination registers

Double BufferingMaintaining previous state while speculative computationN+1 commits, reusing destination register of iteration N on iteration N+2No instruction dependence in N and N+2Speculative writing in output register allowedSlide35

With Double buffering

Dynamic linkage between dependent instructions and source registersChanging logical register mappingOverwriting output register columnInstruction stored in program order:Downstream instructions obtain proper source mappingSlide36

Source, destination and iterationSlide37

Register reclamation

Any instruction misprediction:Flushing downstream instructionsPropagation of mappings to all newly scheduled instructionsBetter than RAT:Complexities reducedSlide38

Queue entries: Lifetime

Received prior to dispatchRetained till instruction exit from backendReused to execute multiple loop iterationsImmediate freeing of LSQ upon commitPosition based age logic in LSQLoad queue entries:Simply reset for futureSlide39

Store Queue entries: An extra effort

Need to write backDrained into write buffer immediately between L1 Cache and queueIf cannot write stallVery rareWrapping around of commit pointerSlide40

Loop pre-execution

Pre-execution of future loads:ParallelizationEnabling zero-latency loadsNo L1 cache access latencyRepeated execution of load till completion of all iterationsExploiting recurrent nature of loop:Highly predictable address patternsSlide41

Learning from example: String copy

Copying source array to destination arrayPredictable load addressAccessing consecutive bytes from memoryPrimary addressing access patterns:StrideConstantPointer-based

Placing simple pattern identification hardware alongside pre-executed load buffersSlide42

Stride based addressing

Most common Iterating over data arrayComputing address Δ between 2 consecutive loadsThird load matches predicted stride: Stride verificationPre-execution of next load

Constant: A special case of zero-sized stride

Reading from same address

Stack allocated variables/ Pointer aliasingSlide43

Pointer based addressing

Value returned by current load -> next address E.g. Linked list traversalsSlide44

Pre-execution: more..

Pre-executed load buffer placed between load queue and L1 Cache interfaceStore clashes with pre-executed loadInvalidating entryCoherency maintenancePre-executed loads:Speculatively waking up dependent operations on next cycle

Incorrect address prediction:

Scheduler cancels and re-issues operationSlide45

Conclusion

Minimizing energy during loop executionElimination of front-end overheads originating from pipeline activity and resource allocationBenefits achieved better than in loop buffers and μop cachesPre-execution increases performance during loop execution by hiding L1 cache latenciesAccording to research, 5.3-18.3% energy-delay benefitSlide46

References

Scheduling Reusable Instructions for Power Reduction (J. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, M. Irwin),2004Matrix Scheduler Reloaded (P. G. Sassone

, J.

Rupley

, E.

Breckelbaum

, G. H.

Loh

, B. Black)

Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops (L. H. Lee, B. Moyer, J.

Arends

)

Power Efficient Loop Execution Techniques (Mitchell Bryan

Hayenga

)