SCNN An Accelerator for Compressedsparse Convolutional Neural Networks 9 authors NVIDIA MIT Berkeley Stanford ISCA 2017 Convolution operation Reuse Memory size vs access energy Dataflow decides reuse ID: 1031090
Download Presentation The PPT/PDF document "New hardware architectures for efficient..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. New hardware architectures for efficient deep net processing
2. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks9 authors @ NVIDIA, MIT, Berkeley, StanfordISCA 2017
3. Convolution operation
4. Reuse
5. Memory: size vs. access energy
6. Dataflow decides reuse
7. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPN = 1 for inferenceReuse activations: Input Stationary ISReuse filters
8. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPInter PE parallelismIntra PE parallelismOuput coordinate
9. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPIntra PE parallelismCartesian Product CP, all to all multiplicationsEach PE contains F*I multipliersVector of F filter weights fetchedVector of I input activations fetchedMultiplier outputs sent to accumulator to compute partial sumsAccumulator has F*I adders to match multiplier throughputPartial sum written at matching coordinate in output activation space
10. Inter PE parallelism
11. Last week: Inter PE parallelism
12. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPPlanar tiled PT, division of input activation map inter PE parallelismEach PE processes C*Wt*Ht inputsOutput halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs
13. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CPInter PE parallelismIntra PE parallelismOuput coordinate
14. Last week: sparsity in FC layerWeights might be 0Activations might be 0Static sparsity of the network, created by pruning during trainingDynamic sparsity (input dependent), created by ReLU operator on specific activations during inferenceNon-zero weightsNon-zero activationsTable of shared weights4 bit index to shared weightsWeight quantization and sharing.
15. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
16. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
17. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
18. Last week: Compressed format to store non-zero weights
19. Example compressed weights for PE0vxp
20. Similar compression of zero weights and activationsNot all weights (F) and Input activations (I)are stored and fetched in the dataflow.
21. PE hardware architecture
22. Other details in paperImplementationEvaluation
23. NextFriday student presentationsTuesday minor 13-4 more lectures on architecture