/
New hardware architectures for efficient deep net processing New hardware architectures for efficient deep net processing

New hardware architectures for efficient deep net processing - PowerPoint Presentation

elina
elina . @elina
Follow
65 views
Uploaded On 2023-11-11

New hardware architectures for efficient deep net processing - PPT Presentation

SCNN An Accelerator for Compressedsparse Convolutional Neural Networks 9 authors NVIDIA MIT Berkeley Stanford ISCA 2017 Convolution operation Reuse Memory size vs access energy Dataflow decides reuse ID: 1031090

sparsity input tiled dataflow input sparsity dataflow tiled product sparse stationary weights planar cartesian activation activations relu pruning present

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "New hardware architectures for efficient..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. New hardware architectures for efficient deep net processing

2. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks9 authors @ NVIDIA, MIT, Berkeley, StanfordISCA 2017

3. Convolution operation

4. Reuse

5. Memory: size vs. access energy

6. Dataflow decides reuse

7. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPN = 1 for inferenceReuse activations: Input Stationary ISReuse filters

8. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPInter PE parallelismIntra PE parallelismOuput coordinate

9. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPIntra PE parallelismCartesian Product CP, all to all multiplicationsEach PE contains F*I multipliersVector of F filter weights fetchedVector of I input activations fetchedMultiplier outputs sent to accumulator to compute partial sumsAccumulator has F*I adders to match multiplier throughputPartial sum written at matching coordinate in output activation space

10. Inter PE parallelism

11. Last week: Inter PE parallelism

12. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPPlanar tiled PT, division of input activation map inter PE parallelismEach PE processes C*Wt*Ht inputsOutput halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs

13. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPInter PE parallelismIntra PE parallelismOuput coordinate

14. Last week: sparsity in FC layerWeights might be 0Activations might be 0Static sparsity of the network, created by pruning during trainingDynamic sparsity (input dependent), created by ReLU operator on specific activations during inferenceNon-zero weightsNon-zero activationsTable of shared weights4 bit index to shared weightsWeight quantization and sharing.

15. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

16. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

17. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

18. Last week: Compressed format to store non-zero weights

19. Example compressed weights for PE0vxp

20. Similar compression of zero weights and activationsNot all weights (F) and Input activations (I)are stored and fetched in the dataflow.

21. PE hardware architecture

22. Other details in paperImplementationEvaluation

23. NextFriday student presentationsTuesday minor 13-4 more lectures on architecture