Charles Eric LaForest J Gregory Steffan ECE University of Toronto FPGA 2012 February 24 Easier FPGA Programming We focus on overlay architectures Nios MicroBlaze Vector Processors ID: 551444
Download Presentation The PPT/PDF document "Octavo: An FPGA-Centric Processor Archit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Octavo: An FPGA-Centric Processor Architecture
Charles Eric
LaForest
J. Gregory
Steffan
ECE, University of Toronto
FPGA 2012, February 24Slide2
Easier FPGA Programming
We focus on overlay architectures
Nios
, MicroBlaze, Vector ProcessorsThese inherited their architectures from ASICsEasy to use with existing software toolsPerformance penaltyASIC architectures poor fit to FPGA hardware!ASIC ≠ FPGAASIC: transistors, poly, vias, metal layersFPGA: LUTs, BRAMs, DSP Blocks, routingFixed widths, depths, other discretizationsFPGA-centric processor design?
2Slide3
Hardware (
Stratix
IV)
Width (bits)Fmax (MHz)DSP Blocks
36
480Block RAMs36550ALUTs1800Nios II/f32230
How do FPGAs Want to Compute?
3
What processor architecture best fits the underlying FPGA?Slide4
Research Goals
Assume threaded data parallelism
Run at maximum
FPGA frequencyHave high performanceNever stallAim for simple, minimal ISAMatch architecture to underlying FPGA4Slide5
Result: Octavo
10
stages, 8 threads, 550 MHz
Family of designsWord width (8 to 72 bits)Memory depth (2 to 32k words)Pipeline depth (8 to 16 stages)Snapshot of work-in-progress5Slide6
Designing Octavo
6Slide7
High-Level View of Octavo
7
Unified registers and RAMSlide8
Octavo vs. Classic RISC
8
All memories unified (no loads/stores)
How to pipeline Octavo?Slide9
Design For Speed:
Self-Loop Characterization
9Slide10
Self-Loop Characterization
Connect module outputs to inputs
Accounts for the FPGA interconnect
Pipeline loop paths to absorb delaysPointed to other limits than raw delayMinimum clock pulse widthsDSP Blocks: 480 MHzBRAMs: 550 MHzWe measured some surprising delays…10Slide11
BRAM Self-Loop Characterization
11
398 MHz
(routing!)656 MHz
531 MHz
710 MHzMust connect BRAMs using registersSlide12
Building Octavo: Memory
12Slide13
Building Octavo: Memory
13Slide14
Memory
14
Replicated “scratchpad” memories with I/O
while still exceeding 550 MHz limit.
Instruction
ALUResultSlide15
Building Octavo: ALU
15Slide16
Building Octavo: ALU
16
Fully pipelined (4 stages)
Never stallsSlide17
Building Octavo: ALU
17
Multiplication
Uses DSP BlocksMust overcome their 480 MHz limit…Slide18
Building Octavo: Multiplier
One multiplier is wide enough but too slow
Two multipliers working at half-speed
Send data to both multipliers in alternation18
480 MHz
600 MHzSlide19
Octavo: Putting It All Together
19Slide20
Octavo
20
0
1
2
34567
8
9
Pipeline10 stagesActually 8 stages with one exception (more later)
No result forwarding or pipeline interlocksScalar, Single-Issue, In-Order, Multi-ThreadedSlide21
Octavo
21
Instruction Memory
Indexed by current thread PCProvides a 3-operand instructionOn-chip BRAMs only
0
12345
6
7
89
ISlide22
Octavo
22
A and B Memories
Receive operand addresses from instructionProvide data operands to ALU and ControllerSome addresses map to I/O portsOn-chip BRAMs only
0
12345
6
7
8
9
I
A/B
A/BSlide23
Octavo
23
Pipeline Registers
Avoid an odd number of stagesSeparate BRAMs for best speedPredicted by BRAM self-loop characterizationUnusual but essential design constraint
0
1234
5
6
78
9
I
A/B
A/BSlide24
Octavo
24
Controller
Receives opcode, source/destination operandsDecides branchesProvides current PC of next thread to I memory
0
1234
56
7
8
9
CTL0
CTL1
I
A/B
A/BSlide25
Octavo
25
ALU
Receives opcode and dataWrites result to all memories
0
123
4
5
67
8
9
ALU0
ALU1
ALU2
ALU3
CTL0
CTL1
I
A/B
A/BSlide26
Octavo
26
0
1
2
34567
8
9
ALU0
ALU1
ALU2
ALU3
CTL0
CTL1
I
A/B
A/B
Longest mandatory loop: 8 stages
Along A/B memories and ALU
Fill with 8 threads to avoid stalls
T6
T7
T2
T3
T4
T5
T0
T1Slide27
Octavo
27
Special case longest loop: 10 stages
Along instruction memory and ALUDoes not affect most computationsAdds a delay slot to subroutine and loop code
0
12345
6
7
8
9
ALU0
ALU1
ALU2
ALU3
CTL0
CTL1
I
A/B
A/BSlide28
Results: Speed and Area
28Slide29
Experimental Framework
Quartus
10.1 targeting
Stratix IV (fastest)Optimize and place for speedAverage speed over 10 placement runsVaried processor parameters:Word widthMemory depthPipeline depth Measure Frequency, Area, and Density29Slide30
Maximum Operating Frequency
30Slide31
Maximum Operating Frequency
31
Faster
Wider
BRAM hard limit
Timing Slack!Slide32
Maximum Operating Frequency
32
550+ MHz
36 bits wide
230 MHz
32 bits wide2.39x faster, but not a fair comparisonSlide33
Maximum Operating Frequency
33
Multiplier CAD Anomaly!
(38 to 54 bits width)
Enough pipeline stages bury the inefficiencySlide34
Area Density
34Slide35
Area Density
35
72 bits, 1024 words
72 bits, 4096 words
“Sweet spot”
67% used(typical)26% usedSlide36
Designing Octavo:
Lessons & Future Work
36Slide37
Lessons
Soft-processors can hit BRAM
Fmax
Octavo: 8 threads, 10 stages, 550 MHzSelf-loop characterization for modulesHelps reason about their pipeliningShows true operating envelopes on FPGAOctavo spans a large design spaceSignificant range of widths, depths, stagesConsider FPGA-centric architecture!37Slide38
Future Work
38