/
Octavo: An FPGA-Centric Processor Architecture Octavo: An FPGA-Centric Processor Architecture

Octavo: An FPGA-Centric Processor Architecture - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
407 views
Uploaded On 2017-05-23

Octavo: An FPGA-Centric Processor Architecture - PPT Presentation

Charles Eric LaForest J Gregory Steffan ECE University of Toronto FPGA 2012 February 24 Easier FPGA Programming We focus on overlay architectures Nios MicroBlaze Vector Processors ID: 551444

mhz octavo fpga loop octavo mhz loop fpga stages bits alu memory pipeline building operating frequency brams maximum 550 processor bram architecture

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Octavo: An FPGA-Centric Processor Archit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Octavo: An FPGA-Centric Processor Architecture

Charles Eric

LaForest

J. Gregory

Steffan

ECE, University of Toronto

FPGA 2012, February 24Slide2

Easier FPGA Programming

We focus on overlay architectures

Nios

, MicroBlaze, Vector ProcessorsThese inherited their architectures from ASICsEasy to use with existing software toolsPerformance penaltyASIC architectures poor fit to FPGA hardware!ASIC ≠ FPGAASIC: transistors, poly, vias, metal layersFPGA: LUTs, BRAMs, DSP Blocks, routingFixed widths, depths, other discretizationsFPGA-centric processor design?

2Slide3

Hardware (

Stratix

IV)

Width (bits)Fmax (MHz)DSP Blocks

36

480Block RAMs36550ALUTs1800Nios II/f32230

How do FPGAs Want to Compute?

3

What processor architecture best fits the underlying FPGA?Slide4

Research Goals

Assume threaded data parallelism

Run at maximum

FPGA frequencyHave high performanceNever stallAim for simple, minimal ISAMatch architecture to underlying FPGA4Slide5

Result: Octavo

10

stages, 8 threads, 550 MHz

Family of designsWord width (8 to 72 bits)Memory depth (2 to 32k words)Pipeline depth (8 to 16 stages)Snapshot of work-in-progress5Slide6

Designing Octavo

6Slide7

High-Level View of Octavo

7

Unified registers and RAMSlide8

Octavo vs. Classic RISC

8

All memories unified (no loads/stores)

How to pipeline Octavo?Slide9

Design For Speed:

Self-Loop Characterization

9Slide10

Self-Loop Characterization

Connect module outputs to inputs

Accounts for the FPGA interconnect

Pipeline loop paths to absorb delaysPointed to other limits than raw delayMinimum clock pulse widthsDSP Blocks: 480 MHzBRAMs: 550 MHzWe measured some surprising delays…10Slide11

BRAM Self-Loop Characterization

11

398 MHz

(routing!)656 MHz

531 MHz

710 MHzMust connect BRAMs using registersSlide12

Building Octavo: Memory

12Slide13

Building Octavo: Memory

13Slide14

Memory

14

Replicated “scratchpad” memories with I/O

while still exceeding 550 MHz limit.

Instruction

ALUResultSlide15

Building Octavo: ALU

15Slide16

Building Octavo: ALU

16

Fully pipelined (4 stages)

Never stallsSlide17

Building Octavo: ALU

17

Multiplication

Uses DSP BlocksMust overcome their 480 MHz limit…Slide18

Building Octavo: Multiplier

One multiplier is wide enough but too slow

Two multipliers working at half-speed

Send data to both multipliers in alternation18

480 MHz

600 MHzSlide19

Octavo: Putting It All Together

19Slide20

Octavo

20

0

1

2

34567

8

9

Pipeline10 stagesActually 8 stages with one exception (more later)

No result forwarding or pipeline interlocksScalar, Single-Issue, In-Order, Multi-ThreadedSlide21

Octavo

21

Instruction Memory

Indexed by current thread PCProvides a 3-operand instructionOn-chip BRAMs only

0

12345

6

7

89

ISlide22

Octavo

22

A and B Memories

Receive operand addresses from instructionProvide data operands to ALU and ControllerSome addresses map to I/O portsOn-chip BRAMs only

0

12345

6

7

8

9

I

A/B

A/BSlide23

Octavo

23

Pipeline Registers

Avoid an odd number of stagesSeparate BRAMs for best speedPredicted by BRAM self-loop characterizationUnusual but essential design constraint

0

1234

5

6

78

9

I

A/B

A/BSlide24

Octavo

24

Controller

Receives opcode, source/destination operandsDecides branchesProvides current PC of next thread to I memory

0

1234

56

7

8

9

CTL0

CTL1

I

A/B

A/BSlide25

Octavo

25

ALU

Receives opcode and dataWrites result to all memories

0

123

4

5

67

8

9

ALU0

ALU1

ALU2

ALU3

CTL0

CTL1

I

A/B

A/BSlide26

Octavo

26

0

1

2

34567

8

9

ALU0

ALU1

ALU2

ALU3

CTL0

CTL1

I

A/B

A/B

Longest mandatory loop: 8 stages

Along A/B memories and ALU

Fill with 8 threads to avoid stalls

T6

T7

T2

T3

T4

T5

T0

T1Slide27

Octavo

27

Special case longest loop: 10 stages

Along instruction memory and ALUDoes not affect most computationsAdds a delay slot to subroutine and loop code

0

12345

6

7

8

9

ALU0

ALU1

ALU2

ALU3

CTL0

CTL1

I

A/B

A/BSlide28

Results: Speed and Area

28Slide29

Experimental Framework

Quartus

10.1 targeting

Stratix IV (fastest)Optimize and place for speedAverage speed over 10 placement runsVaried processor parameters:Word widthMemory depthPipeline depth Measure Frequency, Area, and Density29Slide30

Maximum Operating Frequency

30Slide31

Maximum Operating Frequency

31

Faster

Wider

BRAM hard limit

Timing Slack!Slide32

Maximum Operating Frequency

32

550+ MHz

36 bits wide

230 MHz

32 bits wide2.39x faster, but not a fair comparisonSlide33

Maximum Operating Frequency

33

Multiplier CAD Anomaly!

(38 to 54 bits width)

Enough pipeline stages bury the inefficiencySlide34

Area Density

34Slide35

Area Density

35

72 bits, 1024 words

72 bits, 4096 words

“Sweet spot”

67% used(typical)26% usedSlide36

Designing Octavo:

Lessons & Future Work

36Slide37

Lessons

Soft-processors can hit BRAM

Fmax

Octavo: 8 threads, 10 stages, 550 MHzSelf-loop characterization for modulesHelps reason about their pipeliningShows true operating envelopes on FPGAOctavo spans a large design spaceSignificant range of widths, depths, stagesConsider FPGA-centric architecture!37Slide38

Future Work

38