/
ARAPrototyper : Enabling Rapid Prototyping and Evaluation for ARAPrototyper : Enabling Rapid Prototyping and Evaluation for

ARAPrototyper : Enabling Rapid Prototyping and Evaluation for - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
367 views
Uploaded On 2018-09-22

ARAPrototyper : Enabling Rapid Prototyping and Evaluation for - PPT Presentation

AcceleratorRich Architecture Jason Cong YuTing Chen Zhenman Fang Bingjun Xiao Peipei Zhou Computer Science Department UCLA Center for DomainSpecific ID: 675501

system acc ara accelerator acc system accelerator ara memory num check flow gam software design image gaussian accelerators int

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ARAPrototyper : Enabling Rapid Prototypi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture

Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei ZhouComputer Science Department, UCLACenter for Domain-Specific ComputingCenter for Future Architectures ResearchSlide2

A stack of research tools for accelerator-rich architectureStandalone accelerator simulation: AladdinStandalone accelerator generation: HLSSystem-level HLS-based ARA simulation: PARADESystem-level pre-RTL SoC

simulation: gem5 + AladdinARA FPGA prototyping: ARAPrototyperEvaluate an ARA on the real prototypeBridge the gap between simulations and real chipsReduce the hardware design efforts and system integration efforts for the users Our Motivation and Goale

arly-stagelate-stageSlide3

Final stageFPGA prototypingCollect real runtime and system power numbersSystem integration (software + hardware)Evaluation of an Accelerator-Rich Architecture Slide4

Using Xilinx Zynq SoC (FPGA fabrics + ARM)

Major components of an ARAGeneral processor coresA sea of heterogeneous acceleratorsMemory system + interconnects (NoC)Prototyping an ARASlide5

ARAPrototyper featuresHighly-automated accelerator prototyping flowReusable baseline prototypeApplication programming interfaces (user APIs)Efficient system software stack (runtime for accelerators)Prototype

platformXilinx Zynq ZC706SoC: Dual-core Cortex-A9 ARM + FPGA fabricsFPGA: implement accelerators, on-chip memories, and interconnectsLinux is portedARAPrototyper: A Rapid Prototyping FlowSlide6

Design efforts reductionAutomatically generated hardware IPs, system software, and APIsReusable interconnect and memory system templatesPush-button flowShort evaluation cycleCan run native binaries Can run large input data set

User can adjust the microarchitecture of an accelerator easilyUse HLS design flowSee the impact at the system-levelAdvantages of ARAPrototyperSlide7

Architecture OverviewSlide8

System Overview (What Can be Modeled?)

CMPCCL1

L1

LLC

C

C

L1

L1

ARA memory system

MP

1

MP

2

MP

3

MP

N

Memory Controller

DRAM DIMMs

IOMMU

TLB

Memory Requests (Virtual)

MP

C

MP

Acc

1

Acc

2

Acc

3

Acc

m

Linux

System software stack

r

eservations,

starts, releases

App

1

App

2

App

k

1. General-purpose cores

2. Heterogeneous accelerators

3.

on-chip

ARA memory system

4. Off-chip memory

system

5

. Coherency level

6. Virtual -> physical address translation

7

. System software (runtime) & APIsSlide9

ARA Template (What Can be Reused?)

Acc1Acc

2

Acc

3

Acc

4

Acc

M

DMAC

1

DMAC

2

DMAC

3

DMAC

N

MP

1

MP

2

MP

3

MP

N

IOMMU

TLB

Memory Requests (Virtual)

Memory Requests

(Physical)

Interconnect Layer 2

Interleaved Network

DMACs

Physical Memory

Ports

Heterogeneous Accelerators

Interconnect Layer 1

Partial Crossbar

Homogeneous Shared Memory Banks

IOMMU and

Dedicated TLB

User-Designed

Accelerators

Highly Parameterized

Hardware TemplatesSlide10

Hardware Prototyping FlowSlide11

Accelerator designIntegrated with high-level synthesis (Xilinx Vivado HLS)ARA system design: ARA specification fileSpecify accelerators, on-chip memories, and interconnectsHighly parameterizedHW templates of on-chip memories and interconnects can be reusedPush-button flowUse

Vivado HLS, Xilinx PlanAhead flow for FPGA bitstream synthesisAutomatically generates the APIs based on the ARA spec. fileARA Design FlowSlide12

ARA Specification File<system><ACCs> <acc type=“gradient” num=“2” num_params

=“5”> <port size=“16384” num=“6”/> </acc> <acc type=“segmentation” num=“1” num_params=“13”> <port size=“16384” num=“8”/> </acc> <acc type=“rician” num=“1” num_params=“7”> <port size=“16384” num=“12”/> </acc> <acc type=“

gaussian” num=“1” num_params=“7”> <port size=“16384” num=“5”/> </acc></ACCs>

<

SharedBuffers

size=“16384”

num

=“32”

numDMACs

=“4” />

<Interconnects>

<

ACCs_to_Buffers

type=“crossbar” ON=“4”/>

<

Buffers_to_DMACs

type=“interleaved”/>

</Interconnects>

<IOMMU>

<TLB size=“8192” evict=“LRU”/>

</IOMMU><CoherentCache

use=“0”/><AccFrequency hz=“75000000”/>

</system>

Accelerator kernels

DMAC

1

DMAC

2

DMAC

3

DMAC

4

Shared memory banks & DMACs

Interconnect configurations

0

1

2

3

4

5

31

32

grad

1

grad

2

seg

1

rician

1

gauss

1

IOMMU

TLB

IOMMU/TLB

MP

1

MP

2

MP

3

MP

4

Coherent L2 cache

Accelerator frequencySlide13

Accelerator Design with HLSvoid gradient ( volatile unsigned int* param1, … volatile unsinged

int* param5, volatile unsigned float mem_port1[SIZE], … volatile unsigned float mem_port6[SIZE], volatile unsigned int* toIOMMU_FIFO, volatile unsigned int* fromIOMMU_FIFO) { // read function parameters int image_vddr = *param1; … // main computation for(int i = 1; i < P; i++) for(int j = 1; j < M; j++) for(int k = 1; k < N; k++) { g[CENTER] = 1.0/sqrt

( EPSILON + SQR(u[CENTER] – u[RIGHT]) + SQR(u[CENTER] – u[LEFT]) + SQR(u[CENTER] – u[DOWN]) + SQR(u[CENTER] – u[UP]) + SQR(u[CENTER] –

u[ZOUT])

+ SQR(u[CENTER] –

u[ZIN

])

);

}

}

Parameters sent from the application

(passed through

AXILite

interface)

Connection to the homogeneous memory banks

Interfaces to IOMMU

Read input parameters

Perform optimization on the original C codes:

Pipelining

Unrolling

(Manually) partition memory into multiple memory banks

Goal (initiation interval = 1)

Minimizing off-chip bandwidth

(data reuse optimization)

grad

1Slide14

ARA

Prototyper design automation flow

Platform backend flow: logic synthesis, mapping, placement and route (Xilinx

PlanAhead

)

FPGA

bitstream

System integration

User-designed ACCs in C

ACC designs in RTL

High-level synthesis

Platform- specific modules

Platform name

ARA specification

f

ile

Module configurations

Platform-independent modules (interconnects and memory system)

Hardware templates

ARA memory

s

ystem optimization

Module instantiation

User-designed accelerators (HLS)

ARA memory system and interconnects (highly parameterized)

Hardware Design Automation Flow

Xilinx

PlanAhead

flowSlide15

APIs and System Software StackSlide16

Include the header files of accelerator definitionHeader is generated from the ARA description file (XML format)User APIs: reserve(), check_reserved

(), send_param(), check_done(), free()Communicate with the software global accelerator manager (GAM)User APIs (How to Program Accelerators)

Header file; declare a class for each type of an accelerator

Declaration: use

acc

object to manipulate Gaussian accelerator in the ARA

1.

reserve

(): reserve the Gaussian accelerator (send a signal to GAM)

2.

check_reserved

(): wait until GAM confirms the reservation

s

end_param

(): send parameters; start the Gaussian accelerator;

M

,

P

,

N

: sizes of the three dimensions;

a

: the input image array

1.

check_done

(): check the done signal from GAM

2.

free

(): free the Gaussian

accelerator;

GAM can use it for the other applicationsSlide17

What’s behind run():User APIs (Cont’d)acc.run

(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1);acc.reserve();while(acc.check_reserved() == 0);acc.send_param(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1);while(acc.check_done() == 0);acc.free();

1.

reserve

(): reserve the Gaussian accelerator (send a signal to GAM)

2.

check_reserved

(): wait until GAM confirms the reservation

s

end_param

(): send parameters; start the Gaussian accelerator;

M

,

P

,

N

: sizes of the three dimensions;

a

: the input image array

1.

check_done

(): check the done signal from GAM

2.

free

(): free the Gaussian

accelerator;

GAM can use it for the other applicationsSlide18

Underlying System Software Stack

Major components:

GAM: global accelerator managerDBA: dynamic buffer allocator

TLB miss handler

Coherence managerSlide19

Flow Efficiency & Case StudySlide20

ARA Evaluation Runtime

One

flow

can be finished within four hours

(1)

ARA

Prototyper

Flow

runtime, (2) Xilinx tool (> 98%):

Vivado

HLS,

PlanAhead

(Map, P&R)

and

(3) native execution on

prototype

2.9x

~ 42.6x

evaluation time reduction compared to full-system

simulations

Native execution on prototype vs. simulation:

10

5

~ 10

6

differenceSlide21

Case Study: A Real ARA PrototypeMedical Imaging Applications

Heterogeneous AcceleratorsAcc0: gradientAcc1:

gaussian

Acc

2

:

rician

Acc

3

: segmentationSlide22

Measured from the real prototype and real machines7.44x better energy efficiency over Intel Sandy-Bridge processor2.22x better energy efficiency over ARM Cortex-A924x - 84x

energy efficiency for ASIC projection ARA Prototype vs. Xeon vs. ARMCortex-A9Xeon (24 threads, OpenMP)ARAFrequency667 MHz1.9 GHzACCs@100MHzCPU@667MHzRuntime (seconds)28.340.55

4.53Power1.1W190W(TDP)3.1WTotal Energy 2.22x

7.44x

1xSlide23

Use Case 1: Accelerator Microarchitecture ModificationData reuse optimizationGoal: II=1 and reduce the required off-chip bandwidth by storing data in the local memory banks2.5x – 5.9x performance gainSlide24

Use Case 2: Interconnect Resource Utilization

Use FPGA prototype to project resource utilization

Two partial crossbar synthesis methods (

Acc

#: 30)

Left-hand side (much more congested and consume more resources)Slide25

An alternative for ARA evaluationCollect the performance and energy data from real siliconGoals of the ARAPrototyperReduce the design effortsShorten the evaluation timeARAPrototyper features:

Highly automated ARA prototyping flowReusable interconnect and memory system templatesAutomatically generated APIsSystem software stack (runtime)SummarySlide26

Thank You!

Zhenman is on the

academic job market, please

check

his

website:

https://sites.google.com/site/fangzhenman/