C ATERPILLAR CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram Stanford University Cerebras Systems CATERPILLAR A Pedram 3 Deep Learning Stack ID: 625788
Download Presentation The PPT/PDF document "Stanford University" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Stanford UniversitySlide2
CATERPILLAR:
CGRA for Accelerating the Training of Deep Neural Networks
Yuanfang Li and
Ardavan Pedram
*
Stanford University
Cerebras SystemsSlide3
CATERPILLAR
© A. Pedram3Slide4
Deep Learning Stack
End Application
Compute
DNN Model and Data
High Level Service
Smart Camera Mobile AppPhoto Recognition API AccessTrained Photo CNNExampleStack
ComputationCompute Infrastructure Required for TrainingCATERPILLAR© A. Pedram4Slide5
Research Efforts as of July 2017CATERPILLAR
© A. Pedram5
CATERPILLAR
Source:ScaleDeep
ISCA 2017Slide6The Neural Networks Zoo
CATERPILLAR
© A. Pedram
6
http://
www.asimovinstitute.org
/neural-network-zoo/ by Asimov Institute. Slide7RNN, CNN,
DFF(MLP)CATERPILLAR
© A. Pedram7Slide8Multilayer Perceptron
Several Fully Connected LayersBasic Operation
Matrix Vector Multiplication (GEMV)CATERPILLAR© A. Pedram
8Slide9
Backpropagation
Training MLPBackpropagationCATERPILLAR
© A. Pedram
9Slide10
Backpropagation
Basic OperationGEMV Update GradientRank-1 update (outer product)Update Weights
CATERPILLAR
© A. Pedram
10Slide11
Gradient Descent
CATERPILLAR© A. Pedram
11
time
t
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
δ
3
1
δ
2
1
δ
1
1
m
B
C
n
A
×
C+=A
×
B
∑
GEMVSlide12
Stochastic Gradient Descent
CATERPILLAR© A. Pedram
12
time
t
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
δ
3
1
δ
2
1
δ
1
1
h
1
2
h
2
2
h
3
2
δ
3
2
δ
2
2
δ
1
2
h
1
i
h
2
i
h
3
i
δ
3
i
δ
2
i
δ
1
i
a
GEMV
Inherently Inefficient
Requirements
Broadcast (systolic /non-systolic)
Reduction (systolic/ tree based)Slide13
Batched Gradient Descent
CATERPILLAR© A. Pedram
13
time
t
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
δ
3
1
δ
2
1
δ
1
1
h
1
2
h
2
2
h
3
2
δ
3
2
δ
2
2
δ
1
2
h
1
i
h
2
i
h
3
i
δ
3
i
δ
2
i
δ
1
i
x
h
1
h
2
h
3
ŷ
h
1
1:4
h
2
1:4
h
3
1:4
δ
3
1:4
δ
2
1:4
δ
1
1:4
h
1
5:8
h
2
5:8
h
3
5:8
δ
3
5:8
δ
2
5:8
δ
1
5:8
b
a
Data Parallelism
GEMV➔ GEMM
GEMM: Memory efficient kernel
#weight updates/batch sizeSlide14
Direct Feedback Alignment
CATERPILLAR© A. Pedram
14
time
t
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
δ
3
1
δ
2
1
δ
1
1
h
1
2
h
2
2
h
3
2
δ
3
2
δ
2
2
δ
1
2
h
1
i
h
2
i
h
3
i
δ
3
i
δ
2
i
δ
1
i
x
h
1
h
2
h
3
ŷ
h
1
1:4
h
2
1:4
h
3
1:4
δ
3
1:4
δ
2
1:4
δ
1
1:4
h
1
5:8
h
2
5:8
h
3
5:8
δ
3
5:8
δ
2
5:8
δ
1
5:8
x
h
1
h
2
h
3
ŷ
h
1
1:4
h
2
1:4
h
3
1:4
h
1
5:8
h
2
5:8
h
3
5:8
δ
3
5:8
b
c
a
Dependence Elimination
Parallelism in backward pass
Effective for smaller networksSlide15
Pipelined Continuous Propagation*
CATERPILLAR
© A. Pedram
15
time
t
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
δ
3
1
δ
2
1
δ
1
1
h
1
2
h
2
2
h
3
2
δ
3
2
δ
2
2
δ
1
2
h
1
i
h
2
i
h
3
i
δ
3
i
δ
2
i
δ
1
i
x
h
1
h
2
h
3
ŷ
h
1
1
h
2
1
h
3
1
d
a
Layer Parallelization
Pipelining Inputs
Layer Locality
More Efficient GEMVs
Smaller Reduction Tree
Weight Temporal Locality
Update and Consume Immediately
*Continuous Propagation, CP,
Cerebras
Systems Patent Pending Slide16What Do We Need to Support?
GEMMGEMVParallelization Between CoresCollective Communications
GatherReduceAll GatherAll ReduceBroadcastEfficient Transpose CATERPILLAR
© A. Pedram
16Slide17CATERPILLAR Architecture
CATERPILLAR
© A. Pedram17
0,0
0,1
0,15
15,0
SRAM SPAD
7
SRAM SPAD
6
SRAM SPAD
5
SRAM SPAD
4
Core
0
Core
1
Core
2
Core
3
7
6
5
4
SRAM SPAD
0
SRAM SPAD
1
SRAM
SPAD
2
SRAM SPAD
3
c)
16
Columns
16 rows
to/from Core In Same
Column
to/from Core In Same rowSlide18CATERPILLAR Architecture
CATERPILLAR
© A. Pedram18
0,0
0,1
0,15
15,0
SRAM SPAD
7
SRAM SPAD
6
SRAM SPAD
5
SRAM SPAD
4
Core
0
Core
1
Core
2
Core
3
7
6
5
4
SRAM SPAD
0
SRAM SPAD
1
SRAM
SPAD
2
SRAM SPAD
3
c)
16
Columns
16 rows
to/from Core In Same
Column
to/from Core In Same row
PE
Native Support for Inner Product
3 Levels of memory hierarchy
Accumulator
Mem B (2 ports)
Mem A (1 port)
Distributed Memory Programming Model
State Machine Reprogrammable
The
Linear Algebra Core PESlide19CATERPILLAR Architecture
CATERPILLAR
© A. Pedram19
0,0
0,1
0,15
15,0
SRAM SPAD
7
SRAM SPAD
6
SRAM SPAD
5
SRAM SPAD
4
Core
0
Core
1
Core
2
Core
3
7
6
5
4
SRAM SPAD
0
SRAM SPAD
1
SRAM
SPAD
2
SRAM SPAD
3
c)
16
Columns
16 rows
to/from Core In Same
Column
to/from Core In Same row
Core
GEMM
Optimized for Rank-1 Updates
Broadcast bus
GEMV
Systolic between neighboring PEs
Accelerate reductionSlide20CATERPILLAR Architecture
CATERPILLAR
© A. Pedram20
0,0
0,1
0,15
15,0
SRAM SPAD
7
SRAM SPAD
6
SRAM SPAD
5
SRAM SPAD
4
Core
0
Core
1
Core
2
Core
3
7
6
5
4
SRAM SPAD
0
SRAM SPAD
1
SRAM
SPAD
2
SRAM SPAD
3
c)
16
Columns
16 rows
to/from Core In Same
Column
to/from Core In Same row
Multicore
Ring of Cores
Reconfigurable
Support for Collective
Comms
All gather
Reduce
All Reduce
Systolic/ParallelSlide21CATERPILLAR Architecture
CATERPILLAR
© A. Pedram21
0,0
0,1
0,15
15,0
SRAM SPAD
7
SRAM SPAD
6
SRAM SPAD
5
SRAM SPAD
4
Core
0
Core
1
Core
2
Core
3
7
6
5
4
SRAM SPAD
0
SRAM SPAD
1
SRAM
SPAD
2
SRAM SPAD
3
c)
16
Columns
16 rows
to/from Core In Same
Column
to/from Core In Same row
Multicore
Ring of Cores
Reconfigurable
Support for Collective
Comms
All gather
Reduce
All Reduce
Systolic/ParallelSlide22GEMV
CATERPILLAR
© A. Pedram22Slide23Forward Path
CATERPILLAR
© A. Pedram23
Reduce1
Reduce2
Input Activation
Output Activation
Broadcast
Reduce
Transpose
& send in time
Core 1 Partition
Core 2 Partition
Current Layer’s weights
to next layer
From previous layer
Broadcast
BroadcastSlide24Delta Path
CATERPILLAR
© A. Pedram24
Output delta
Input delta
To Previous layer
Reduce
TransposeCore 1 PartitionCore 2 Partition
Broadcast
Broadcast
Reduce
Broadcast
Back From Next Layer
ReduceSlide25Multicore GEMM
CATERPILLAR
© A. Pedram25
All gather
Off Core memory distribution
×
Go to Next
Layer
Batched Samples
On ChipSlide26Multicore GEMM
CATERPILLAR
© A. Pedram26
×
All Reduce
Batched Samples
On Chip
Off Core memory distribution Slide27
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide28
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide29
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide30
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide31
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide32
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide33
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide34
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide35
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide36
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide37
Source: Robert
van de
Geijn
The Bucket AlgorithmSlide38Methodology
Networks, Dataset, and Algorithms:MNISTBatch Sizes 2,4,8,50,100
#Layers 4,5,6Deep & Wide Network2500-2000-1500-1000-500-10ArchitectureHalf Precision FPU16KB of local memory
512 KB private SRAM/Core
45 nm @ 1 GHz
2×16 cores with 16×16 PEs103.2 mm22×4 cores with 4×4 PEs178.9mm2
CATERPILLAR© A. Pedram38Slide39Pure Convergence Analyses
CATERPILLAR
© A. Pedram
39
CP:Cerebras
Systems patent pending Slide40Pure Convergence Analyses
CATERPILLAR
© A. Pedram
40
z
CP:Cerebras
Systems patent pending Slide41
Pure Convergence Analyses
CATERPILLAR© A. Pedram
41
z
CP:Cerebras
Systems patent pending Slide42Hardware Analyses
Combine Epoch to convergence with hardwareEnergy to ConvergenceTime to Convergence
Network size Fit /don’t fit on the coresBigger Network Converge FasterNeed More computeBatched Algorithms
Use GEMM
Faster
Converge Slower CATERPILLAR
© A. Pedram42Slide43Energy to Convergence 32 4x4 Cores
CATERPILLAR
© A. Pedram
43
500-500-500-10
Fits on Cores
2500-2000-1500-100-500-10Does not fit on Cores CP:Cerebras Systems patent pending Slide44Time to Accuracy
Going off-core is Expensive
Minibatched Converge Faster than Non-minibatched if the network does not fit
CATERPILLAR
© A. Pedram
44
CP:Cerebras Systems patent pending Slide45Conclusion
Training MLP DNNs and Their Effect on ConvergenceExploration of the Design Space of Accelerators for Various BP algorithms
CATERPILLARBoth GEMV and GEMM KernelsCollective Communications If Network Fitspipelined backpropagation consistently performs the bestIf Network Does not Fit
Minibatched
algorithms have comparable performance to pipelined backpropagation,
CATERPILLAR
© A. Pedram45CP:Cerebras Systems patent pending Slide46
CATERPILLAR
© A. Pedram
46