/
1 Lecture: Simba  Topics: multi-chip-module design for scalable inference 1 Lecture: Simba  Topics: multi-chip-module design for scalable inference

1 Lecture: Simba Topics: multi-chip-module design for scalable inference - PowerPoint Presentation

caroline
caroline . @caroline
Follow
66 views
Uploaded On 2023-09-19

1 Lecture: Simba Topics: multi-chip-module design for scalable inference - PPT Presentation

2 MultiChipModule MCM A single package that includes multiple dies chips or chiplets 36 chiplets in this case reduces design cost The package substrate has interchip links that are better ID: 1017794

chiplets latency layer mapping latency chiplets mapping layer chiplet package global simba chip spreading communication batching links small impact

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Lecture: Simba Topics: multi-chip-mod..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. 1Lecture: Simba Topics: multi-chip-module design for scalable inference

2. 2Multi-Chip-Module (MCM) A single package that includes multiple dies (chips or chiplets) – 36 chiplets in this case – reduces design cost The package substrate has inter-chip links that are better than off-package links, but not as fast as on-chip links

3. 3Goals Create a single low-cost chiplet that can be used in edge devices and also in datacenters in MCM form Given the cost of inter-chiplet communication, what is the most effective way to split computation across chiplets? Perform low-latency inference with no batching (?) Note that DaDianNao had a similar approach, but was doing board-level integration and didn’t scrutinize computation mapping

4. 4Simba Package and ChipletThe Global PE and RISC-V controller orchestrate inputs to the next layer.Outputs of a layer are also sent to a global PE and monitored by controller. Global PE also handles computations that have little reuse/parallelism.

5. Simba PE5The PE is a scaled down version of the NVDLA8x8 MAC array wt-stationaryNew inputs keep coming every cycleFinal Act is sent to a global PEThe Global PE and RISC-V controller orchestrate inputs to the next layer32KB8KB3KB8b wts/acts; 24b accum

6. 6ParametersInter-package networkIntra-package networkTotal of 64KB*36 = 2.25MB

7. 7System Metrics Operates at variable frequency At low voltage (0.42V), as little as 0.11 pJ/Op At high voltage (1.2V), as much as 2 GHz and 4 TOPS At 1.1V, the 36-chiplet system achieves 128 TOPs, 6.1 TOPs/W Most of the eval is at 0.72V, 1 GHz, 11 Gbps links Latency gap is because of delays to get to the edge, clock- synch, serdes encoding, and package link delays

8. 8Computation Mapping Strategies Latency normalized to ideal (100% utilization on all chiplets) Showing different mapping for each layer of ResNet (mapping matters!) More chiplets reduce latency in most cases More chiplets incr energy-per-op because of communication (next slide) Latency and energy-efficiency vary by layer Would have been nice to show the best static mapping

9. 9Energy

10. 10Computation Spreading Latency for one of the layers Too much spreading across PEs can worsen latency Spreading across chiplets is worse than spreading within a chiplet The type of mapping and the extent of mapping both matter

11. 11Computation Spreading Three representative layers Marginal returns with many chiplets Reinforces observation in previous slide; also notes variation in layers

12. 12NoP Bandwidth Impact Changing the network speed does improve latency Only 5% in one layer and 27% in another Similar intra-chiplet experiment would have been informative too Key question: why not overlap computation and communication?

13. 13NoP Latency Impact Much bigger impact on latency for one layer

14. 14Impact of Batch Size For a small network (a large network can’t handle much batching) Favorable impact on throughput, minor negative effect on latency Latency increases because of more synchronization costs Note that previous figures were examples of strong scaling (fixed problem size), while this is an example of weak scaling

15. 15Comparison to GPUs Simba is a lot worse than GPUs at high batching factors Latency numbers needed for a fair comparison – note that Simba is designed for low latency (0.5 ms) with minimal batching factor Note that batching was an opportunity for TPU and it used large (24MB) Act buffer to achieve high throughput and low energy

16. 16New Idea 1 Depending on the position of the global PE providing Acts, the latency for each chiplet varies; impacts tail latency for task completion Can therefore assign non-uniform work to each chiplet for uniform latency Small improvement; same question about comm/comp overlap

17. 17New Idea 2 Each layer demands a different communication pattern; some broadcast to all, while others have frequent comm within a group Greedy algorithm to identify where inputs and outputs get mapped to reduce communication Small benefit, maybe more interesting from an energy perspective?

18. 18New Idea 3 For small layers, can perform multiple simultaneously in a pipelined fashion Improves throughput No discussion of latency

19. 19References “Simba: Scaling Deep Learning Inference with Multi-Chip-Module based Architecture”, Y.S. Shao et al., MICRO 2019