/
Analysis of Sparse Convolutional Neural Networks Analysis of Sparse Convolutional Neural Networks

Analysis of Sparse Convolutional Neural Networks - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
432 views
Uploaded On 2017-09-03

Analysis of Sparse Convolutional Neural Networks - PPT Presentation

Sabareesh Ganapathy Manav Garg Prasanna Venkatesh Srinivasan Convolutional Neural Network State of the art in Image classification Terminology Feature Maps Weights Layers Convolution ID: 584784

matrix sparse gpu sparsity sparse matrix sparsity gpu analysis memory gemm layers csrmm convolutional layer weight weights multiply format alexnet networks cpu

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Analysis of Sparse Convolutional Neural ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Analysis of Sparse Convolutional Neural Networks

Sabareesh Ganapathy

Manav Garg

Prasanna

Venkatesh

SrinivasanSlide2

Convolutional Neural Network

State of the art in Image classification

Terminology – Feature Maps, Weights

Layers - Convolution,

ReLU

, Pooling, Fully Connected

Example -

AlexNet

2012

(Image taken from mdpi.com)Slide3

Sparsity in Networks

Trained networks occupy huge memory. (~200MB for

AlexNet

)

Many DRAM accesses.

Filters can be sparse. Sparsity percentage for filters in

Alexnet convolutional layers shown below.Sparsity too low. Thresholding in inference will lead to accuracy loss.Account for sparsity during training.

Layer

Sparsity Percentage

CONV2

6

CONV3

7

CONV4

7Slide4

Sparsity in Networks

Deep Compression

Pruning low-weight values, retrain to recover accuracy.

Less storage and also speedup reported in Fully Connected layers.

Structured Sparsity Learning (SSL) in Deep Neural Networks

Employs locality optimization for sparse weights.

Memory savings and speedup in Convolutional layers.

Layer

Sparsity Percentage

CONV1

14

CONV2

54

CONV3

72

CONV4

68

CONV5

58Slide5

Caffe Framework

Open-source framework to build and run convolutional neural nets.

Provides Python and C++ interface for inference.

Source code in C++ and

Cuda

.

Employs efficient data structures for feature maps and weights.Blob data structureCaffe Model Zoo - Repository of CNN models for analysis.Pretrained models for base and compressed versions of AlexNet available in the Model zoo.Slide6

Convolution = Matrix Multiply

.

.

.

IFM: 227x227x3

Filter: 11x11x3x96

Stride: 4

OFM: 55x55x96

IFM converted to 363x3025 matrix

filter looks at 11x11x3 input volume, 55 locations along W,H.

Weights converted to 96x363 matrix

OFM = Weights x IFM.

BLAS libraries used to implement matrix multiply ( GEMM )

MKL for CPU,

CuBLAS

for GPUSlide7

Sparse Matrix Multiply

Weight matrix can be represented in sparse format for sparse networks.

Compressed Sparse Row Format. Matrix converted to arrays to represent non-zero values.

Array A - Contains non zero values.

Array JA - Column index of each element in A.

Array IA - Cumulative sum of number of non-zero values in previous rows.

Sparse representation saves memory and could result in efficient computation.Wen-Wei – New Caffe branch for sparse convolution

Represent convolutional layer weights in CSR format.

Uses sparse matrix multiply routines. (CSRMM)

Weight in sparse format, IFM in dense , Output is in dense.

MKL library for CPU,

cuSPARSE

for GPU.Slide8

Analysis Framework

Initially gem5-gpu was planned to be used as simulation framework. Gem5 ended up being very slow due to the large size of Deep Neural Networks.

Analysis was performed by running Caffe and

Cuda

programs on Native Hardware.

For CPU analysis, AWS system with 2 Intel Xeon cores running @ 2.4GHz was used.

For GPU analysis, dodeca system with NVIDIA GeForce GTX 1080 GPU was used.Slide9

Convolutional Layer Analysis

Deep Compression and SSL trained networks were used for analysis. Both showed similar trends.

Memory savings obtained with sparse representation is given below

The time taken for the multiplication was recorded. Conversion time to CSR format was not included as weights are

sparsified

only once for a set of IFMs.

Layers

Memory Saving

CONV1

1.11

CONV2

2.48

CONV3

2.72

CONV4

2.53

CONV5

2.553Slide10

CPU- CSRMM vs GEMM

CSRMM slower compared to GEMM.

Overhead depends on sparsity percentage.Slide11

GPU- CSRMM vs GEMM

CSRMM overhead more in GPU.

GPU operations are faster compared to CPU.Slide12

Fully Connected Layer (FC) Analysis

Fully Connected Layers form the final layers of a typical CNN and implemented as Matrix Vector Multiply operation. (GEMV).

Modified Caffe’s internal data structures (blob) to represent Weights of FC layer in sparse format.

Sparse

Matrix-Vector Multiplication (

SpMV

) used for sparse computation.

Deep Compression Model used for analysis.

Image taken from petewarden.comSlide13

FC Layer Analysis

Speed-up of 3x observed for both CPU and GPU.Slide14

Matrix Multiply Analysis

Custom C++ and

Cuda

programs were written to measure the time taken to execute only matrix multiplication routines.

This allowed us to vary the sparsity of weight matrix to figure out the break-even point where CSRMM performs faster than GEMM.

Size of the weight matrix were chosen to be equal to that of the largest

AlexNet CONV layer.The zeros were distributed randomly in the weight matrix.Slide15

Matrix Multiply Analysis

Slide16

GPU Memory Scaling Experiment

Motivation:

As Sparse Matrix representation occupies significantly less space compared to dense representation, a larger working set can fit in the Cache/Memory system in the case of sparse.

Implementation:

As the “Weights” matrix was the one being passed in the Sparse format, we increased its size to a larger value. Slide17

GPU Results (GEMM vs CSRMM)

Format

Total Memory Used (MB)

Runtime (ns)

Dense

3377

60473

Sparse

141

207910

Format

Total Memory Used (MB)

Runtime (ns)

Dense

5933

96334

Sparse

1947

184496

GEMM vs CSRMM ( Weight Matrix = 25600 x 24000 )

GEMM vs CSRMM ( Weight Matrix = 256 x 1200 )

While GEMM is still faster in both cases, (GEMM/CSRMM) Time Ratio increases to

0.52

from

0.29

as the Weight Matrix Dimensions are bumped. Slide18

GPU: Sparse X Sparse

IFM sparsity due to

ReLU

activation. Sparsity in CONV layers of

AlexNet

given below

CUSP library used in custom program for sparse x sparse multiply of IFM and Weights.Speedup of 4x observed compared to GEMM routine.Memory savings of 4.2x compared to GEMM.Could not scale to typical dimensions of AlexNet. This is in progress.

Layer

Sparsity

Percentage

CONV2

23.57

CONV3

56.5

CONV4

64.7

CONV5

68.5Slide19

CONCLUSION

Representing Matrices in Sparse Format results in significant Memory savings as expected

.

We didn’t observe any practical computational benefits for Convolutional Layers using library routines provide by MKL &

cuSPARSE

for both CPU & GPU.

Fully Connected Layers showed around 3x Speedup for layers having a high sparsity.For a large dataset & GPU Memory, we might see drop in convolutional runtime for Sparse representation.

Sparse x Sparse computation showed promising results. Will be implemented in Caffe.Slide20

THANK YOU

QUESTIONS ?