Ali Shafiee Anirban Nag Naveen Muralimanohar Rajeev Balasubramonian John Paul Strachan Miao Hu R Stanley Williams Vivek Srikumar University of Utah ID: 1035684
Download Presentation The PPT/PDF document "ISAAC: A Convolutional Neural Network Ac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in CrossbarsAli Shafiee*, Anirban Nag*, Naveen Muralimanohar†, Rajeev Balasubramonian*, John Paul Strachan†, Miao Hu†, R. Stanley Williams†, Vivek Srikumar**University of Utah †Hewlett Packard Labs
2. Executive Summary2Classifying Images is in vogueConv netsare the bestLots of vector-matrix multiplicationAnalog memristorcrossbar is a great fitAnalog to Digital conversion overheads!Smart encoding reduces such overheads Balanced pipeline critical for high efficiencyPreserving high precision is essential in analogISAAC 14.8x better in throughput and 5.5x better in energy than digital state of the art (DaDianNao)
3. State of the art Convolutional Neural Networks3Deep Residual NetworksConvolution LayersPooling LayersFully Connected Layers152 layers!11 billion operations!
4. Convolution OperationNxNi = 3Ky = 2Kx = 2Kernel 0Stride Sx, Sy = 1Kernel 1Kernel 2No = 3Ny4
5. Memristor Dot-product Enginex0x1x2x3V1G1I1 = V1.G1V2G2I2 = V2.G2I = I1 + I2 =V1.G1 + V2.G2w00w01w02w03w10w11w12w13w30w31w32w33w20w21w22w23y0y1y2y35
6. Memristor Dot-product Engine6Ky = 2Kx = 2Kernel 0Stride Sx, Sy = 1Kernel 1Kernel 2NxNi = 3NyNo = 3Kernel 0Kernel 1Kernel 2
7. Crossbar716-bit16-bit16-bit16-bit16-bit16-bit16-bit16-bitInput Neurons2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit2-bit16 iterations1-bit1-bit1-bit1-bit
8. ISAAC Organization8Digital To AnalogOutputRegisterShift and AddSigmoidRows0 - 12716 IterationsPartial Output 0Partial Output 1Rows 128 - 255Analog to DigitalInput RegisterCrossbarOutputRegisterShift and Add
9. An ISAAC Chip – Inter-Tile Pipelined 9Layer 1Layer 2Layer 3eDRAMeDRAMeDRAMTile 1Tile 2Tile 3
10. Balanced Pipeline10Layer i: Sx = 1 and Sy = 2Replicate layer i–1 two times.Storage allocation:Start from last layer Not computed yetReceived from previous layerServiced and released
11. Balanced Pipeline11Crossbar128x128Crossbar128x128Crossbar128x128Sx = 2, Sy = 2Sx = 1 Sy = 2Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Crossbar128x128Sx = 2, Sy = 2
12. The ADC overhead12Large areaPower hungryArea and power increases exponentially with ADC resolution and frequency
13. The ADC overhead13ADC Resolution= log (R) + v + w – 1 (if v=1)ADC Resolution = 9 bitsw bitsw bitsw bitsw bitsR rowsv bitsv bitsv bitsv bitsDACMemristor cellslog (R) + v + w – 1 w = 2 R = 128 v = 1 9-bit ADC
14. Encoding Scheme14 1111DACMemristor cells R-1 If MSB = 1 with maximal inputIf MSB = 1Store weights in flipped form such that MSB = 0 always.Effective ADC resolution required = 8 bitsMSB = 0MAXIMALINPUT
15. Handling Signed Arithmetic15Input neurons MSB = 1 represents −215 For 16th iteration do shift-and-subtractWeightsBias of 215 Subtract as many biases as the number of 1s in input2’s ComplimentLike FP exponent representation
16. Analysis Metrics16CE: Computational Efficiency -> GOPS/s × mm2PE: Power Efficiency -> GOPS/WSE: Storage Efficiency -> MB/mm2
17. Design Space Exploration171) rows per crossbar2) ADCs per IMA3) crossbars per IMA4) IMA per tile
18. Design Space Exploration18ISAAC-CEISAAC-PEISAAC-SEISAAC-CEISAAC-PEISAAC-SEGOPs/mm2GOPs/WVarious Design PointsVarious Design Points
19. Power Contribution1949%7%12%58%5%16%Hyper Transport3%Router
20. Improvement over DaDianNao (Throughput)20Throughput: 14.8x better because:Memristor crossbar have high computational parallelism2. DaDianNao fetches both inputs and weights from eDRAM, ISAAC fetches just inputs3. DaDianNao suffers due to bandwidth limitation in fully connected layers.ISAAC requires more power but is 5.5x better in terms of energy due to above reasons.Deep Neural Net Benchmarks
21. Conclusion21Takes advantage of analog in-situ computing. Fetches just the input neurons.Handles ADC overheads with smart encoding.Does not compromise on output precision.Is faster than DaDianNao due to 8x better computational efficiency and a balanced pipeline keeping all units busy. Few questions still remain: integrate online training?