/
Stanford University Stanford University

Stanford University - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
384 views
Uploaded On 2018-01-21

Stanford University - PPT Presentation

C ATERPILLAR CGRA for Accelerating the Training of Deep Neural Networks Yuanfang Li and Ardavan Pedram Stanford University Cerebras Systems CATERPILLAR A Pedram 3 Deep Learning Stack ID: 625788

spad sram core caterpillar sram spad caterpillar core pedram source van robert bucket algorithm geijn reduce cerebras systems cores broadcast pending gemm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Stanford University" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Stanford UniversitySlide2

CATERPILLAR:

CGRA for Accelerating the Training of Deep Neural Networks

Yuanfang Li and

Ardavan Pedram

*

Stanford University

Cerebras SystemsSlide3

CATERPILLAR

© A. Pedram3Slide4

Deep Learning Stack

End Application

Compute

DNN Model and Data

High Level Service

Smart Camera Mobile AppPhoto Recognition API AccessTrained Photo CNNExampleStack

ComputationCompute Infrastructure Required for TrainingCATERPILLAR© A. Pedram4Slide5

Research Efforts as of July 2017CATERPILLAR

© A. Pedram5

CATERPILLAR

Source:ScaleDeep

ISCA 2017Slide6
The Neural Networks Zoo

CATERPILLAR

© A. Pedram

6

http://

www.asimovinstitute.org

/neural-network-zoo/ by Asimov Institute. Slide7
RNN, CNN,

DFF(MLP)CATERPILLAR

© A. Pedram7Slide8
Multilayer Perceptron

Several Fully Connected LayersBasic Operation

Matrix Vector Multiplication (GEMV)CATERPILLAR© A. Pedram

8Slide9

Backpropagation

Training MLPBackpropagationCATERPILLAR

© A. Pedram

9Slide10

Backpropagation

Basic OperationGEMV Update GradientRank-1 update (outer product)Update Weights

CATERPILLAR

© A. Pedram

10Slide11

Gradient Descent

CATERPILLAR© A. Pedram

11

time

t

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

δ

3

1

δ

2

1

δ

1

1

m

B

C

n

A

×

C+=A

×

B

GEMVSlide12

Stochastic Gradient Descent

CATERPILLAR© A. Pedram

12

time

t

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

δ

3

1

δ

2

1

δ

1

1

h

1

2

h

2

2

h

3

2

δ

3

2

δ

2

2

δ

1

2

h

1

i

h

2

i

h

3

i

δ

3

i

δ

2

i

δ

1

i

a

GEMV

Inherently Inefficient

Requirements

Broadcast (systolic /non-systolic)

Reduction (systolic/ tree based)Slide13

Batched Gradient Descent

CATERPILLAR© A. Pedram

13

time

t

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

δ

3

1

δ

2

1

δ

1

1

h

1

2

h

2

2

h

3

2

δ

3

2

δ

2

2

δ

1

2

h

1

i

h

2

i

h

3

i

δ

3

i

δ

2

i

δ

1

i

x

h

1

h

2

h

3

ŷ

h

1

1:4

h

2

1:4

h

3

1:4

δ

3

1:4

δ

2

1:4

δ

1

1:4

h

1

5:8

h

2

5:8

h

3

5:8

δ

3

5:8

δ

2

5:8

δ

1

5:8

b

a

Data Parallelism

GEMV➔ GEMM

GEMM: Memory efficient kernel

#weight updates/batch sizeSlide14

Direct Feedback Alignment

CATERPILLAR© A. Pedram

14

time

t

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

δ

3

1

δ

2

1

δ

1

1

h

1

2

h

2

2

h

3

2

δ

3

2

δ

2

2

δ

1

2

h

1

i

h

2

i

h

3

i

δ

3

i

δ

2

i

δ

1

i

x

h

1

h

2

h

3

ŷ

h

1

1:4

h

2

1:4

h

3

1:4

δ

3

1:4

δ

2

1:4

δ

1

1:4

h

1

5:8

h

2

5:8

h

3

5:8

δ

3

5:8

δ

2

5:8

δ

1

5:8

x

h

1

h

2

h

3

ŷ

h

1

1:4

h

2

1:4

h

3

1:4

h

1

5:8

h

2

5:8

h

3

5:8

δ

3

5:8

b

c

a

Dependence Elimination

Parallelism in backward pass

Effective for smaller networksSlide15

Pipelined Continuous Propagation*

CATERPILLAR

© A. Pedram

15

time

t

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

δ

3

1

δ

2

1

δ

1

1

h

1

2

h

2

2

h

3

2

δ

3

2

δ

2

2

δ

1

2

h

1

i

h

2

i

h

3

i

δ

3

i

δ

2

i

δ

1

i

x

h

1

h

2

h

3

ŷ

h

1

1

h

2

1

h

3

1

d

a

Layer Parallelization

Pipelining Inputs

Layer Locality

More Efficient GEMVs

Smaller Reduction Tree

Weight Temporal Locality

Update and Consume Immediately

*Continuous Propagation, CP,

Cerebras

Systems Patent Pending Slide16
What Do We Need to Support?

GEMMGEMVParallelization Between CoresCollective Communications

GatherReduceAll GatherAll ReduceBroadcastEfficient Transpose CATERPILLAR

© A. Pedram

16Slide17
CATERPILLAR Architecture

CATERPILLAR

© A. Pedram17

0,0

0,1

0,15

15,0

SRAM SPAD

7

SRAM SPAD

6

SRAM SPAD

5

SRAM SPAD

4

Core

0

Core

1

Core

2

Core

3

7

6

5

4

SRAM SPAD

0

SRAM SPAD

1

SRAM

SPAD

2

SRAM SPAD

3

c)

16

Columns

16 rows

to/from Core In Same

Column

to/from Core In Same rowSlide18
CATERPILLAR Architecture

CATERPILLAR

© A. Pedram18

0,0

0,1

0,15

15,0

SRAM SPAD

7

SRAM SPAD

6

SRAM SPAD

5

SRAM SPAD

4

Core

0

Core

1

Core

2

Core

3

7

6

5

4

SRAM SPAD

0

SRAM SPAD

1

SRAM

SPAD

2

SRAM SPAD

3

c)

16

Columns

16 rows

to/from Core In Same

Column

to/from Core In Same row

PE

Native Support for Inner Product

3 Levels of memory hierarchy

Accumulator

Mem B (2 ports)

Mem A (1 port)

Distributed Memory Programming Model

State Machine Reprogrammable

The

Linear Algebra Core PESlide19
CATERPILLAR Architecture

CATERPILLAR

© A. Pedram19

0,0

0,1

0,15

15,0

SRAM SPAD

7

SRAM SPAD

6

SRAM SPAD

5

SRAM SPAD

4

Core

0

Core

1

Core

2

Core

3

7

6

5

4

SRAM SPAD

0

SRAM SPAD

1

SRAM

SPAD

2

SRAM SPAD

3

c)

16

Columns

16 rows

to/from Core In Same

Column

to/from Core In Same row

Core

GEMM

Optimized for Rank-1 Updates

Broadcast bus

GEMV

Systolic between neighboring PEs

Accelerate reductionSlide20
CATERPILLAR Architecture

CATERPILLAR

© A. Pedram20

0,0

0,1

0,15

15,0

SRAM SPAD

7

SRAM SPAD

6

SRAM SPAD

5

SRAM SPAD

4

Core

0

Core

1

Core

2

Core

3

7

6

5

4

SRAM SPAD

0

SRAM SPAD

1

SRAM

SPAD

2

SRAM SPAD

3

c)

16

Columns

16 rows

to/from Core In Same

Column

to/from Core In Same row

Multicore

Ring of Cores

Reconfigurable

Support for Collective

Comms

All gather

Reduce

All Reduce

Systolic/ParallelSlide21
CATERPILLAR Architecture

CATERPILLAR

© A. Pedram21

0,0

0,1

0,15

15,0

SRAM SPAD

7

SRAM SPAD

6

SRAM SPAD

5

SRAM SPAD

4

Core

0

Core

1

Core

2

Core

3

7

6

5

4

SRAM SPAD

0

SRAM SPAD

1

SRAM

SPAD

2

SRAM SPAD

3

c)

16

Columns

16 rows

to/from Core In Same

Column

to/from Core In Same row

Multicore

Ring of Cores

Reconfigurable

Support for Collective

Comms

All gather

Reduce

All Reduce

Systolic/ParallelSlide22
GEMV

CATERPILLAR

© A. Pedram22Slide23
Forward Path

CATERPILLAR

© A. Pedram23

Reduce1

Reduce2

Input Activation

Output Activation

Broadcast

Reduce

Transpose

& send in time

Core 1 Partition

Core 2 Partition

Current Layer’s weights

to next layer

From previous layer

Broadcast

BroadcastSlide24
Delta Path

CATERPILLAR

© A. Pedram24

Output delta

Input delta

To Previous layer

Reduce

TransposeCore 1 PartitionCore 2 Partition

Broadcast

Broadcast

Reduce

Broadcast

Back From Next Layer

ReduceSlide25
Multicore GEMM

CATERPILLAR

© A. Pedram25

All gather

Off Core memory distribution

×

Go to Next

Layer

Batched Samples

On ChipSlide26
Multicore GEMM

CATERPILLAR

© A. Pedram26

×

All Reduce

Batched Samples

On Chip

Off Core memory distribution Slide27

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide28

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide29

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide30

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide31

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide32

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide33

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide34

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide35

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide36

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide37

Source: Robert

van de

Geijn

The Bucket AlgorithmSlide38
Methodology

Networks, Dataset, and Algorithms:MNISTBatch Sizes 2,4,8,50,100

#Layers 4,5,6Deep & Wide Network2500-2000-1500-1000-500-10ArchitectureHalf Precision FPU16KB of local memory

512 KB private SRAM/Core

45 nm @ 1 GHz

2×16 cores with 16×16 PEs103.2 mm22×4 cores with 4×4 PEs178.9mm2

CATERPILLAR© A. Pedram38Slide39
Pure Convergence Analyses

CATERPILLAR

© A. Pedram

39

CP:Cerebras

Systems patent pending Slide40
Pure Convergence Analyses

CATERPILLAR

© A. Pedram

40

z

CP:Cerebras

Systems patent pending Slide41

Pure Convergence Analyses

CATERPILLAR© A. Pedram

41

z

CP:Cerebras

Systems patent pending Slide42
Hardware Analyses

Combine Epoch to convergence with hardwareEnergy to ConvergenceTime to Convergence

Network size Fit /don’t fit on the coresBigger Network Converge FasterNeed More computeBatched Algorithms

Use GEMM

Faster

Converge Slower CATERPILLAR

© A. Pedram42Slide43
Energy to Convergence 32 4x4 Cores

CATERPILLAR

© A. Pedram

43

500-500-500-10

Fits on Cores

2500-2000-1500-100-500-10Does not fit on Cores CP:Cerebras Systems patent pending Slide44
Time to Accuracy

Going off-core is Expensive

Minibatched Converge Faster than Non-minibatched if the network does not fit

CATERPILLAR

© A. Pedram

44

CP:Cerebras Systems patent pending Slide45
Conclusion

Training MLP DNNs and Their Effect on ConvergenceExploration of the Design Space of Accelerators for Various BP algorithms

CATERPILLARBoth GEMV and GEMM KernelsCollective Communications If Network Fitspipelined backpropagation consistently performs the bestIf Network Does not Fit

Minibatched

algorithms have comparable performance to pipelined backpropagation,

CATERPILLAR

© A. Pedram45CP:Cerebras Systems patent pending Slide46

CATERPILLAR

© A. Pedram

46

Related Contents


Next Show more