/
236    2021 IEEE International SolidState Circuits Conference 236    2021 IEEE International SolidState Circuits Conference

236 2021 IEEE International SolidState Circuits Conference - PDF document

thomas
thomas . @thomas
Follow
342 views
Uploaded On 2021-09-09

236 2021 IEEE International SolidState Circuits Conference - PPT Presentation

151 A Programmable NeuralNetwork Inference Accelerator Hongyang Jia Murat Ozatay Yinqi Tang Hossein Valavi Rakshit Pathak This paper presents a scalable neuralnetwork NN inference accelerator in ID: 878053

chip cimu activation network cimu chip network activation bit memory output compute weight input ocn conv buffer cima networkweight

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "236 2021 IEEE International SolidStat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 236 € 2021 IEEE International Solid-St
236 € 2021 IEEE International Solid-State Circuits Conference 15.1 A Programmable Neural-Network Inference Accelerator Hongyang Jia, Murat Ozatay*, Yinqi Tang*, Hossein Valavi*, Rakshit Pathak*, This paper presents a scalable neural-network (NN) inference accelerator in 16nm, based on an array of programmable cores employing mixed-signal In-Memory Computing (IMC), digital Near-Memory Computing (NMC), and localized buffering/control. IMC achieves high energy ef“ciency and throughput for matrix-vector multiplications (MVMs), which dominate NNs; but, scalability poses numerous challenges, both DIGEST OF TECHNICAL PAPERS € Unit (CIMU) cores and on-chip network (OCN), as well as interface circuitry for Figure 15.1.2: Details of CIMU core, showing internal datapath comprising buffering/control logic, Compute-In-Memory-Array (CIMA) mixed-signal IMC enhance ef“ciency of spatial mappings, necessary for ensuring high hardware Figure 15.1.5: Basic chip operation measurements, showing CIMA column-compute Figure 15.1.4: Illustration of spatial-mapping techniques to support NN model InputBufferShortcutBuffer BPBS SIMDInstr.Mem CMPTSIMD Instr.Mem Ctrl. UnitCIMA(1152 rows × 256 col. In-Memory Computing) CIMA Weight Loader From On-Chip Network (OCN)To On-Chip Network (OCN)From Weight-Loading Network Compute In-Memory Array (CIMA) 8-b ADC M-BCM-BCM-BCM-BCMultiplying Bit Cell (M-BC)BLbBLbWL,IA,IAbIAbBLbWL,IA,IAb W CIMUCIMUCIMUCIMU CIMUCIMUCIMUCIMU CIMUCIMUCIMUCIMU CIMUCIMUCIMUCIMU On-chip Network On-chip Network On-chip Network On-chip Network Scalable IMC Architecture (4×4 CIMU array demoed)& Interfaces to Off-chip Segmented Weight Buffer Compute-In-Memory Unit (CIMU) Compute-In-Memory Array (CIMA) Programmable Digital SIMD Compute and Dataflow Buffers rogingContro Activation Buffer Activation Buffer Activation Buffer Activation Buffer Off-ControPLL On-Chip Network (OCN) Network Out BlockNetwork Out BlockNetwork In BlockNetwork In BlockDisjoint Buffer Switch Duo-DirectionalPipelined Routing witc ort Weight NetworkWeight NetworkWeight NetworkWeight NetworkWeight NetworkWeight NetworkWeight NetworkWeight Network W0 0y N N-1 Ny

2 2 x Output-Depth Extension 0 0 N-1 0 0 N
2 x Output-Depth Extension 0 0 N-1 0 0 N-1 6 (input activation routed to multiple CIMU)(output activation routing) Input-Depth Extension(digital adder for scalable depth extension) N-1SkipCIMAOperation(face-to-faceconnections for2× depth extension) 303304 DatasetCIFAR-10ImageNetNeural-network StyleVGGResNet-50 Neural-network Demonstrations Bit Precision4b/4b Weight/Act.4b/4b Weight/Act.Accuracy of Chip(vs Sim.)Energy per Class.91.51% (vs 91.68%) ThroughputNeural-network TopologyL1 : CONV 3×3 128L2 : CONV 3×3 128L3 : CONV 3×3 128L4 : CONV 3×3 128 POOLL5 : CONV 3×3 256L6 : CONV 3×3 256 POOLL7 : CONV 3×3 256L8 : CONV 3×3 256 POOLL9 : DENSE 1024L10: DENSE 1024L11: DENSE 10CONV 1×1 64CONV 3×3 64CONV 1×1 256Stride 2CONV 1×1 128CONV 3×3 128CONV 1×1 512Stride 2CONV 1×1 256CONV 3×3 256CONV 1×1 1024Stride 2CONV 1×1 512CONV 3×3 512CONV 1×1 2048DENSE 1000 StackStackStackStackChip Measurements 050100150200250300Data Index -500-250 Compute ValueShortcut projection in bottleneck layers employs higher precision, as typically required (10% of total MACs).Energy and throughput are for array-based accelerator core.Last stack mapped to chip at time of submission, due to time constraints (other stacks being mapped). Bit-true sim.Chip meas. 7815 image/sec19.4 µJ Comparison Table TechnologyArea (mm Voltage (V)On-chip MemoryBit PrecisionPeak MAC Throughput (GOPS) Peak MAC Energy Eff. (TOPS/W)MAC Comp. Density (TOPS/mmApplication This work16nm Jia,JSSC65nm8.561.2 | 0.85576 Kb4.5 Mb 54.6 | 136.625 | 120.015 | 0.038CNN,RNN,MLP, etc.Dong,ISSCC0.00320.8 | 1.04 Kb372 | 455611 | 436116 | 1422.6711.8KYue,ISSCC65nm0.9 1.0516 Kb2 b 36.1579.380.066CNN, MLPWang,ISSCC28nmArbitrary0.6 1 Mb1310.16Vector OPsMatrix-Vec. VLSI65nm1 | 1 | 3 b0.9 64 Kb38.40.0140.73 In-memory ComputingNot In-memory ComputingISSCC65nm16 b0.8 864 Kb1.92K0.1921.33Bankman,ISSCC28nm0.6 | 0.82.6 Mb3.75 | 250.0024 | 0.016 48.3 | 33.4Only consider memory conduct In-memory ComputingNormalized to 4-bit computeScale with number of bits used for input activation and weight121Optimized for Execution Scala

3 bilityYESNOYES YESNONONONONOJiao,ISSCC12
bilityYESNOYES YESNONONONONOJiao,ISSCC12nm7090.851536 Mb8 b | 16 b11.84.65Number by exploring weight/input sparsity 050100150200 -50 Compute ValueData Index(CIFAR-10 CNN, 4-b weights/activations) (ImageNet ResNet-50, 4-b weights/activations) 581 image/sec334 µJ73.08% (vs 73.17%) Bit-true sim.Chip meas. Matrix-Vec. Chip Summary Technology (nm)16FCLK (MHz)200Voltage (V)0.8Total Area (mm)25Energy Breakdown (measured using 3×3×128 CNN kernel w/ 4b/4b weight/activation) CIMA (pJ/output act.)CIMA Write (pJ/bit)0.23BPBS SIMD (pJ/output act.)3.89IA Buff. (pJ/output act.)8.60 19.08OCN (pJ/bit/seg.)0.27 020040060080010001200Ideal Pre-ADC Value100150200250 050100150200250ADC Output Code -11 INL (LSB)ADC Output CodeNoise(LSB 234306 234306Measured across 256 columns of a CIMU (error bars show std. deviation)For typical ADC 0.40.6 Averaged across 256 columns of a CIMU (error bars show std. deviation)Avg. noise across range:0.68 LSBRMS SQNR (dB)Weight Bit Precision2468 CMPT SIMD (pJ/output act.)0.60 048 51520 5 Activation Bit Precision: 2 Activation Bit Precision: 4 Bit-true sim.Chip meas. Activation Bit Precision: 8cter ectorcteriza Dense Layer Output Pre-activation (e.g., 4-b weights) Convolutional Layer (e.g., stride step 1) Input BufferCIMALSTM Layer CrossumnOPs Shortcut Connection Shortcut Buffer Cross-input SIMD Output Activation(from OCN)(from OCN) tyt CMPTSIMD (configurable accessingfor pooling)(from OCN) i (e.g., 4-b weights) Input BufferCIMA i (from OCN) Output Pre-activation (e.g., 4-b weights) Input BufferCIMA i z i f o z i f Input BufferCIMA € 2021 IEEE International Solid-State Circuits Conference978-1-7281-9549-0/21/$31.00 ©2021 IEEE Figure 15.1.7: Die photo of prototype implemented in 16nm. CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU CIMU OCNOCNOCNOCN Activation BufferActivation BufferActivation BufferActivation BufferPLL Ctrl. Test Structure 5 mm [7] R. Guo et al., A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition VLSI 35.8TOPS/W System Energy Ef“ciency Using Dynamic-Sparsity Performance-Scaling [9] Q. Dong et al., A 351TOPS/W and 372.4GOPS Compute-in-Memory SRAM Macro