AcceleratorRich Architecture Jason Cong YuTing Chen Zhenman Fang Bingjun Xiao Peipei Zhou Computer Science Department UCLA Center for DomainSpecific ID: 675501
Download Presentation The PPT/PDF document "ARAPrototyper : Enabling Rapid Prototypi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei ZhouComputer Science Department, UCLACenter for Domain-Specific ComputingCenter for Future Architectures ResearchSlide2
A stack of research tools for accelerator-rich architectureStandalone accelerator simulation: AladdinStandalone accelerator generation: HLSSystem-level HLS-based ARA simulation: PARADESystem-level pre-RTL SoC
simulation: gem5 + AladdinARA FPGA prototyping: ARAPrototyperEvaluate an ARA on the real prototypeBridge the gap between simulations and real chipsReduce the hardware design efforts and system integration efforts for the users Our Motivation and Goale
arly-stagelate-stageSlide3
Final stageFPGA prototypingCollect real runtime and system power numbersSystem integration (software + hardware)Evaluation of an Accelerator-Rich Architecture Slide4
Using Xilinx Zynq SoC (FPGA fabrics + ARM)
Major components of an ARAGeneral processor coresA sea of heterogeneous acceleratorsMemory system + interconnects (NoC)Prototyping an ARASlide5
ARAPrototyper featuresHighly-automated accelerator prototyping flowReusable baseline prototypeApplication programming interfaces (user APIs)Efficient system software stack (runtime for accelerators)Prototype
platformXilinx Zynq ZC706SoC: Dual-core Cortex-A9 ARM + FPGA fabricsFPGA: implement accelerators, on-chip memories, and interconnectsLinux is portedARAPrototyper: A Rapid Prototyping FlowSlide6
Design efforts reductionAutomatically generated hardware IPs, system software, and APIsReusable interconnect and memory system templatesPush-button flowShort evaluation cycleCan run native binaries Can run large input data set
User can adjust the microarchitecture of an accelerator easilyUse HLS design flowSee the impact at the system-levelAdvantages of ARAPrototyperSlide7
Architecture OverviewSlide8
System Overview (What Can be Modeled?)
CMPCCL1
L1
LLC
C
C
L1
L1
ARA memory system
MP
1
MP
2
MP
3
MP
N
Memory Controller
DRAM DIMMs
IOMMU
TLB
Memory Requests (Virtual)
MP
C
MP
Acc
1
Acc
2
Acc
3
Acc
m
Linux
System software stack
r
eservations,
starts, releases
App
1
App
2
App
k
1. General-purpose cores
2. Heterogeneous accelerators
3.
on-chip
ARA memory system
4. Off-chip memory
system
5
. Coherency level
6. Virtual -> physical address translation
7
. System software (runtime) & APIsSlide9
ARA Template (What Can be Reused?)
Acc1Acc
2
Acc
3
Acc
4
Acc
M
DMAC
1
DMAC
2
DMAC
3
DMAC
N
MP
1
MP
2
MP
3
MP
N
IOMMU
TLB
Memory Requests (Virtual)
Memory Requests
(Physical)
Interconnect Layer 2
Interleaved Network
DMACs
Physical Memory
Ports
Heterogeneous Accelerators
Interconnect Layer 1
Partial Crossbar
Homogeneous Shared Memory Banks
IOMMU and
Dedicated TLB
User-Designed
Accelerators
Highly Parameterized
Hardware TemplatesSlide10
Hardware Prototyping FlowSlide11
Accelerator designIntegrated with high-level synthesis (Xilinx Vivado HLS)ARA system design: ARA specification fileSpecify accelerators, on-chip memories, and interconnectsHighly parameterizedHW templates of on-chip memories and interconnects can be reusedPush-button flowUse
Vivado HLS, Xilinx PlanAhead flow for FPGA bitstream synthesisAutomatically generates the APIs based on the ARA spec. fileARA Design FlowSlide12
ARA Specification File<system><ACCs> <acc type=“gradient” num=“2” num_params
=“5”> <port size=“16384” num=“6”/> </acc> <acc type=“segmentation” num=“1” num_params=“13”> <port size=“16384” num=“8”/> </acc> <acc type=“rician” num=“1” num_params=“7”> <port size=“16384” num=“12”/> </acc> <acc type=“
gaussian” num=“1” num_params=“7”> <port size=“16384” num=“5”/> </acc></ACCs>
<
SharedBuffers
size=“16384”
num
=“32”
numDMACs
=“4” />
<Interconnects>
<
ACCs_to_Buffers
type=“crossbar” ON=“4”/>
<
Buffers_to_DMACs
type=“interleaved”/>
</Interconnects>
<IOMMU>
<TLB size=“8192” evict=“LRU”/>
</IOMMU><CoherentCache
use=“0”/><AccFrequency hz=“75000000”/>
</system>
Accelerator kernels
DMAC
1
DMAC
2
DMAC
3
DMAC
4
Shared memory banks & DMACs
Interconnect configurations
0
1
2
3
4
5
31
32
grad
1
grad
2
seg
1
rician
1
gauss
1
IOMMU
TLB
IOMMU/TLB
MP
1
MP
2
MP
3
MP
4
Coherent L2 cache
Accelerator frequencySlide13
Accelerator Design with HLSvoid gradient ( volatile unsigned int* param1, … volatile unsinged
int* param5, volatile unsigned float mem_port1[SIZE], … volatile unsigned float mem_port6[SIZE], volatile unsigned int* toIOMMU_FIFO, volatile unsigned int* fromIOMMU_FIFO) { // read function parameters int image_vddr = *param1; … // main computation for(int i = 1; i < P; i++) for(int j = 1; j < M; j++) for(int k = 1; k < N; k++) { g[CENTER] = 1.0/sqrt
( EPSILON + SQR(u[CENTER] – u[RIGHT]) + SQR(u[CENTER] – u[LEFT]) + SQR(u[CENTER] – u[DOWN]) + SQR(u[CENTER] – u[UP]) + SQR(u[CENTER] –
u[ZOUT])
+ SQR(u[CENTER] –
u[ZIN
])
);
}
}
Parameters sent from the application
(passed through
AXILite
interface)
Connection to the homogeneous memory banks
Interfaces to IOMMU
Read input parameters
Perform optimization on the original C codes:
Pipelining
Unrolling
(Manually) partition memory into multiple memory banks
Goal (initiation interval = 1)
Minimizing off-chip bandwidth
(data reuse optimization)
grad
1Slide14
ARA
Prototyper design automation flow
Platform backend flow: logic synthesis, mapping, placement and route (Xilinx
PlanAhead
)
FPGA
bitstream
System integration
User-designed ACCs in C
ACC designs in RTL
High-level synthesis
Platform- specific modules
Platform name
ARA specification
f
ile
Module configurations
Platform-independent modules (interconnects and memory system)
Hardware templates
ARA memory
s
ystem optimization
Module instantiation
User-designed accelerators (HLS)
ARA memory system and interconnects (highly parameterized)
Hardware Design Automation Flow
Xilinx
PlanAhead
flowSlide15
APIs and System Software StackSlide16
Include the header files of accelerator definitionHeader is generated from the ARA description file (XML format)User APIs: reserve(), check_reserved
(), send_param(), check_done(), free()Communicate with the software global accelerator manager (GAM)User APIs (How to Program Accelerators)
Header file; declare a class for each type of an accelerator
Declaration: use
acc
object to manipulate Gaussian accelerator in the ARA
1.
reserve
(): reserve the Gaussian accelerator (send a signal to GAM)
2.
check_reserved
(): wait until GAM confirms the reservation
s
end_param
(): send parameters; start the Gaussian accelerator;
M
,
P
,
N
: sizes of the three dimensions;
a
: the input image array
1.
check_done
(): check the done signal from GAM
2.
free
(): free the Gaussian
accelerator;
GAM can use it for the other applicationsSlide17
What’s behind run():User APIs (Cont’d)acc.run
(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1);acc.reserve();while(acc.check_reserved() == 0);acc.send_param(7, Image::get_M(), Image::get_M(), Image::get_M(), a.get_ptr(), 1, 1, 1);while(acc.check_done() == 0);acc.free();
1.
reserve
(): reserve the Gaussian accelerator (send a signal to GAM)
2.
check_reserved
(): wait until GAM confirms the reservation
s
end_param
(): send parameters; start the Gaussian accelerator;
M
,
P
,
N
: sizes of the three dimensions;
a
: the input image array
1.
check_done
(): check the done signal from GAM
2.
free
(): free the Gaussian
accelerator;
GAM can use it for the other applicationsSlide18
Underlying System Software Stack
Major components:
GAM: global accelerator managerDBA: dynamic buffer allocator
TLB miss handler
Coherence managerSlide19
Flow Efficiency & Case StudySlide20
ARA Evaluation Runtime
One
flow
can be finished within four hours
(1)
ARA
Prototyper
Flow
runtime, (2) Xilinx tool (> 98%):
Vivado
HLS,
PlanAhead
(Map, P&R)
and
(3) native execution on
prototype
2.9x
~ 42.6x
evaluation time reduction compared to full-system
simulations
Native execution on prototype vs. simulation:
10
5
~ 10
6
differenceSlide21
Case Study: A Real ARA PrototypeMedical Imaging Applications
Heterogeneous AcceleratorsAcc0: gradientAcc1:
gaussian
Acc
2
:
rician
Acc
3
: segmentationSlide22
Measured from the real prototype and real machines7.44x better energy efficiency over Intel Sandy-Bridge processor2.22x better energy efficiency over ARM Cortex-A924x - 84x
energy efficiency for ASIC projection ARA Prototype vs. Xeon vs. ARMCortex-A9Xeon (24 threads, OpenMP)ARAFrequency667 MHz1.9 GHzACCs@100MHzCPU@667MHzRuntime (seconds)28.340.55
4.53Power1.1W190W(TDP)3.1WTotal Energy 2.22x
7.44x
1xSlide23
Use Case 1: Accelerator Microarchitecture ModificationData reuse optimizationGoal: II=1 and reduce the required off-chip bandwidth by storing data in the local memory banks2.5x – 5.9x performance gainSlide24
Use Case 2: Interconnect Resource Utilization
Use FPGA prototype to project resource utilization
Two partial crossbar synthesis methods (
Acc
#: 30)
Left-hand side (much more congested and consume more resources)Slide25
An alternative for ARA evaluationCollect the performance and energy data from real siliconGoals of the ARAPrototyperReduce the design effortsShorten the evaluation timeARAPrototyper features:
Highly automated ARA prototyping flowReusable interconnect and memory system templatesAutomatically generated APIsSystem software stack (runtime)SummarySlide26
Thank You!
Zhenman is on the
academic job market, please
check
his
website:
https://sites.google.com/site/fangzhenman/