Sabareesh Ganapathy Manav Garg Prasanna Venkatesh Srinivasan Convolutional Neural Network State of the art in Image classification Terminology Feature Maps Weights Layers Convolution ID: 584784
Download Presentation The PPT/PDF document "Analysis of Sparse Convolutional Neural ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Analysis of Sparse Convolutional Neural Networks
Sabareesh Ganapathy
Manav Garg
Prasanna
Venkatesh
SrinivasanSlide2
Convolutional Neural Network
State of the art in Image classification
Terminology – Feature Maps, Weights
Layers - Convolution,
ReLU
, Pooling, Fully Connected
Example -
AlexNet
2012
(Image taken from mdpi.com)Slide3
Sparsity in Networks
Trained networks occupy huge memory. (~200MB for
AlexNet
)
Many DRAM accesses.
Filters can be sparse. Sparsity percentage for filters in
Alexnet convolutional layers shown below.Sparsity too low. Thresholding in inference will lead to accuracy loss.Account for sparsity during training.
Layer
Sparsity Percentage
CONV2
6
CONV3
7
CONV4
7Slide4
Sparsity in Networks
Deep Compression
Pruning low-weight values, retrain to recover accuracy.
Less storage and also speedup reported in Fully Connected layers.
Structured Sparsity Learning (SSL) in Deep Neural Networks
Employs locality optimization for sparse weights.
Memory savings and speedup in Convolutional layers.
Layer
Sparsity Percentage
CONV1
14
CONV2
54
CONV3
72
CONV4
68
CONV5
58Slide5
Caffe Framework
Open-source framework to build and run convolutional neural nets.
Provides Python and C++ interface for inference.
Source code in C++ and
Cuda
.
Employs efficient data structures for feature maps and weights.Blob data structureCaffe Model Zoo - Repository of CNN models for analysis.Pretrained models for base and compressed versions of AlexNet available in the Model zoo.Slide6
Convolution = Matrix Multiply
.
.
.
IFM: 227x227x3
Filter: 11x11x3x96
Stride: 4
OFM: 55x55x96
IFM converted to 363x3025 matrix
filter looks at 11x11x3 input volume, 55 locations along W,H.
Weights converted to 96x363 matrix
OFM = Weights x IFM.
BLAS libraries used to implement matrix multiply ( GEMM )
MKL for CPU,
CuBLAS
for GPUSlide7
Sparse Matrix Multiply
Weight matrix can be represented in sparse format for sparse networks.
Compressed Sparse Row Format. Matrix converted to arrays to represent non-zero values.
Array A - Contains non zero values.
Array JA - Column index of each element in A.
Array IA - Cumulative sum of number of non-zero values in previous rows.
Sparse representation saves memory and could result in efficient computation.Wen-Wei – New Caffe branch for sparse convolution
Represent convolutional layer weights in CSR format.
Uses sparse matrix multiply routines. (CSRMM)
Weight in sparse format, IFM in dense , Output is in dense.
MKL library for CPU,
cuSPARSE
for GPU.Slide8
Analysis Framework
Initially gem5-gpu was planned to be used as simulation framework. Gem5 ended up being very slow due to the large size of Deep Neural Networks.
Analysis was performed by running Caffe and
Cuda
programs on Native Hardware.
For CPU analysis, AWS system with 2 Intel Xeon cores running @ 2.4GHz was used.
For GPU analysis, dodeca system with NVIDIA GeForce GTX 1080 GPU was used.Slide9
Convolutional Layer Analysis
Deep Compression and SSL trained networks were used for analysis. Both showed similar trends.
Memory savings obtained with sparse representation is given below
The time taken for the multiplication was recorded. Conversion time to CSR format was not included as weights are
sparsified
only once for a set of IFMs.
Layers
Memory Saving
CONV1
1.11
CONV2
2.48
CONV3
2.72
CONV4
2.53
CONV5
2.553Slide10
CPU- CSRMM vs GEMM
CSRMM slower compared to GEMM.
Overhead depends on sparsity percentage.Slide11
GPU- CSRMM vs GEMM
CSRMM overhead more in GPU.
GPU operations are faster compared to CPU.Slide12
Fully Connected Layer (FC) Analysis
Fully Connected Layers form the final layers of a typical CNN and implemented as Matrix Vector Multiply operation. (GEMV).
Modified Caffe’s internal data structures (blob) to represent Weights of FC layer in sparse format.
Sparse
Matrix-Vector Multiplication (
SpMV
) used for sparse computation.
Deep Compression Model used for analysis.
Image taken from petewarden.comSlide13
FC Layer Analysis
Speed-up of 3x observed for both CPU and GPU.Slide14
Matrix Multiply Analysis
Custom C++ and
Cuda
programs were written to measure the time taken to execute only matrix multiplication routines.
This allowed us to vary the sparsity of weight matrix to figure out the break-even point where CSRMM performs faster than GEMM.
Size of the weight matrix were chosen to be equal to that of the largest
AlexNet CONV layer.The zeros were distributed randomly in the weight matrix.Slide15
Matrix Multiply Analysis
Slide16
GPU Memory Scaling Experiment
Motivation:
As Sparse Matrix representation occupies significantly less space compared to dense representation, a larger working set can fit in the Cache/Memory system in the case of sparse.
Implementation:
As the “Weights” matrix was the one being passed in the Sparse format, we increased its size to a larger value. Slide17
GPU Results (GEMM vs CSRMM)
Format
Total Memory Used (MB)
Runtime (ns)
Dense
3377
60473
Sparse
141
207910
Format
Total Memory Used (MB)
Runtime (ns)
Dense
5933
96334
Sparse
1947
184496
GEMM vs CSRMM ( Weight Matrix = 25600 x 24000 )
GEMM vs CSRMM ( Weight Matrix = 256 x 1200 )
While GEMM is still faster in both cases, (GEMM/CSRMM) Time Ratio increases to
0.52
from
0.29
as the Weight Matrix Dimensions are bumped. Slide18
GPU: Sparse X Sparse
IFM sparsity due to
ReLU
activation. Sparsity in CONV layers of
AlexNet
given below
CUSP library used in custom program for sparse x sparse multiply of IFM and Weights.Speedup of 4x observed compared to GEMM routine.Memory savings of 4.2x compared to GEMM.Could not scale to typical dimensions of AlexNet. This is in progress.
Layer
Sparsity
Percentage
CONV2
23.57
CONV3
56.5
CONV4
64.7
CONV5
68.5Slide19
CONCLUSION
Representing Matrices in Sparse Format results in significant Memory savings as expected
.
We didn’t observe any practical computational benefits for Convolutional Layers using library routines provide by MKL &
cuSPARSE
for both CPU & GPU.
Fully Connected Layers showed around 3x Speedup for layers having a high sparsity.For a large dataset & GPU Memory, we might see drop in convolutional runtime for Sparse representation.
Sparse x Sparse computation showed promising results. Will be implemented in Caffe.Slide20
THANK YOU
QUESTIONS ?