/
Towards Efficient Learning for Visual and Sequential Data Towards Efficient Learning for Visual and Sequential Data

Towards Efficient Learning for Visual and Sequential Data - PowerPoint Presentation

wilson
wilson . @wilson
Follow
28 views
Uploaded On 2024-02-09

Towards Efficient Learning for Visual and Sequential Data - PPT Presentation

Sachin Mehta Outline Convolution Neural Networks Discrete convolution x0 x1 x2 x3 x4 x5 x6 x7 x8 y4 k0 k1 k2 k3 k4 k5 k6 k7 k8 Input Kernel Output A discrete convolution is a linear transformation ID: 1044900

convolution input size output input convolution output size layer network kernel cell state lstm channels group arxiv dilated parameters

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Towards Efficient Learning for Visual an..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Towards Efficient Learning for Visual and Sequential DataSachin Mehta

2. Outline

3. Convolution Neural Networks

4. Discrete convolutionx0x1x2x3x4x5x6x7x8y4k0k1k2k3k4k5k6k7k8*=InputKernelOutputA discrete convolution is a linear transformationSparse – only few inputs contribute to a given output unitReuses parameters – same kernel is applied over multiple input elementsFigure: In this example, each output element is computed using 9 pixels Figure: Kernel strides over input

5. Convolutional Neural Network (CNN)A standard CNN for image classification is composed of:Convolutional layersDown-sampling layers Strided convolutionMax pooling Avg. Pooling Batch normalization – See references for more detailsActivation functions (e.g. ReLU) – See references for more details

6. Convolutional layer

7. Convolution LayerConvolution layer takes an input feature map of dimension and produces an output feature map of dimension Each layer is defined using following parameters:# Input channels (N)# Output channels (M)Kernel size PaddingStride 

8. Convolution LayerFigure: In this example, 5x5 input is convolved with 3x3 kernel with stride=padding=1 to produce an output of size 5x5.Figure: In this example, 5x5 input is convolved with 3x3 kernel with stride=2 and padding=1 to produce an output of size 3x3.

9. Convolution LayerConvolution layer takes dimensional input feature map and produces a dimensional output featureEach layer is defined using following parameters:# Input channels ()# Output channels ()Kernel size ( PaddingStride# of parameters learned by convolution layer is  

10. Dilated Convolution LayerInserts spaces between the kernel element to increase the effective size of kernelSame as the convolutional layer except it has additional parameter, dilation rate, that controls the spacingEach layer is defined using following parameters:# Input channels ()# Output channels ()Kernel size ( PaddingStrideDilation rate () 

11. Group Convolution LayerInput and kernel are split into groups across channel dimensionEach group then performs the convolutions independentlyEach layer is defined using following parameters:# Input channels ()# Output channels ()Kernel size ( PaddingStrideDilation rate ()# of groups ()Parameter reduction?? 

12. Group vs Standard Convolution LayerFigure: Standard convolution Figure: Grouped convolution

13. Depth-wise ConvolutionSpecial case of group convolution where each channel is processed independently # input channels = # groups = # output channelsParameter reduction??

14. Down-sampling

15. Down-samplingLearning representations at multiple scales is a fundamental step in computer visionLaplacian PyramidsSIFT, etc.Down-sampling in CNNsStrided convolutionMax pooling Avg. Pooling a0a1b0b1a2a3b2b3c0c1d0d1c2c3d2d3a0b1c1d3Max-PoolingAvg. PoolingyConvolution

16. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic SegmentationIn collaboration with Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh HajishirziSource code is available at: https://sacmehta.github.io/ESPNet/ECCV’18MICCAI’18

17. ESP BlockESP is the basic building block of ESPNetStandard convolution is replaced byPoint-wise convolutionSpatial pyramid of dilated convolutionFigure: ESP Kernel-level visualizationFigure: ESP block-level visualization

18. Gridding problem with Dilated ConvolutionsFigure: Gridding artifact in dilated convolution

19. Gridding problem with Dilated ConvolutionsSolutionAdd convolution layers with lower dilation rate at the end of the network (see below links for more details)Cons: Network parameter increases Source:Yu, Fisher, Vladlen Koltun, and Thomas Funkhouser. "Dilated residual networks." CVPR, 2017.Wang, Panqu, et al. "Understanding convolution for semantic segmentation." WACV, 2018.

20. Hierarchical feature fusion for de-griddingFigure: ESP Block with Hierarchical Feature Fusion (HFF)

21. Hierarchical feature fusion (HFF) for de-griddingFigure: Feature map visualization with and without HFFFigure: ESP Block with HFF

22. Comparison with efficient networks

23. Network size vs AccuracyNetwork size is the amount of space required to store the network parametersUnder similar constraints, ESPNet outperform MobileNet and ShuffleNet by about 6%.

24. Inference Speed vs AccuracyInference speed is measured in terms of frames processed per second.Device - LaptopCUDA Cores – 640Under similar constraints, ESPNet outperform MobileNet and ShuffleNet by about 6%.

25. Comparison with state-of-the-art networks

26. Accuracy vs Network sizeNetwork size is the amount of space required to store the network parametersESPNet is small in size and well suited for edge devices.

27. Accuracy vs Network parametersESPNet learns fewer parameters while delivering competitive accuracy.

28. Power Consumption vs Inference SpeedFigure: Standard GPU (NVIDIA-TitanX: 3,500+ CUDA Cores)Figure: Mobile GPU (NVIDIA-Titan 960M: 640 CUDA Cores)ESPNet is fast and consumes less power while having a good segmentation accuracy.

29. Inference Speed and Power Consumption on Embedded Device (NVIDIA TX2)Figure: Inference speed at different GPU frequenciesFigure: Power consumption vs samplesESPNet processes a RGB image of size 1024x512 at a frame rate of 9 FPS.

30. Qualitative Results (Cityscapes Dataset)

31. Recurrent Neural Networks (RNNs)

32. RNN To process sequential dataFigure: In this example, a neural network A takes an input and produces an output  

33. LSTM: Long Short Term Memory networkLSTMs are widely used form of RNNsLSTM has gates that enables them to remove or add the information to the cell state

34. LSTM cell stateCell state is like a conveyer belt that runs down the entire chain with some minor interactions with the output of different gates

35. LSTM gatesForget gate: Throw away the information that is not required by the cell state

36. LSTM gatesWhat we want to store in the cell state?Two parts:Sigmoid part (input gate layer): Identifies which values to updatetanh part (context gate layer): Generates new candidate values that should be added to the cell state

37. LSTM gatesUpdate cell stateForget: Multiply the output with previous cell state Update: Add the candidate values to the cell state

38. LSTM gatesOutput is decided on the cell stateSigmoid layer: To decide which parts we want to outputTanh function: Cell state is scaled between -1 and 1Multiply the output of sigmoid layer and tanh function to output only the parts that we decided in the sigmoid layer

39. Pyramidal Recurrent Unit for Language ModelingSachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, and Hannaneh HajishirziCode available at https://github.com/sacmehta/PRUEMNLP’18

40. LSTM

41. But they aren’t that great…

42. Pyramidal Recurrent Unit (Capture coarse and fine grained information in different embedding spaces through sub-samplingMakes better, more confident decisionsUtilizes high-dimensional representations without overfitting

43. Language ModelingWeloveeatingloveeating???tofufacebookactuallythusswim…ProbabilityInput

44. Language ModelingWeloveeatingloveeating???tofufacebookactuallythusswim…hamentaschenProbabilityInput

45. Language ModelingWeloveeatingloveeating???tofufacebookactuallythusswim…hamentaschenProbabilityInputPerplexity: How confused the model is by next word(lower is better)

46. PRU for Language ModelingImproves perplexityLearns fewer parameters Converges faster

47. Pyramidal transformation for Pyramid: sub-sample input at different scales (average pooling) Transform: learn scale-specific representations Residual Connection: improves gradient flow  LSTMPRU

48. Grouped linear transformation for Group: split input into smaller groups Transform: learn group-specific representations Merge: concatenate to produce the output PRULSTM

49. Pyramidal transformation for Pyramid: sub-sample input at different scales (average pooling)Transform: learn scale-specific representationsResidual Connection: Prior to gatingGrouped linear transformation for Group: split input into smaller groupsTransform: learn group-specific representationsMerge: concatenate to produce the output 

50. PRU vs LSTMPenn Treebank Dataset2 Pyramid levels Groups increase with Hidden size Hidden Size 1000, 1 GroupHidden Size 1400, 4 Groups

51. Compared to State of the Art Regularizing and Optimizing LSTM Language Models, Merrity et al. ICLR 2018ParametersPerplexityChar CNN19 M78.9SRU24 M73.2QRNN18 M78.3AWD-LSTM24 M57.3PRU19 M62.42AWD-PRU19 M56.56

52. ReferencesIoffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.Xu, B., Wang, N., Chen, T. and Li, M., 2015. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.Mehta, S., Rastegari, M., Caspi, A., Shapiro, L. and Hajishirzi, H., 2018. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. arXiv preprint arXiv:1803.06815.Mehta, S., Koncel-Kedziorski, R., Rastegari, M. and Hajishirzi, H., 2018. Pyramidal Recurrent Unit for Language Modeling. arXiv preprint arXiv:1808.09029.LSTM Blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/Convolution arithmetic: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

53. Thanks!!