/
1 Lecture: Eyeriss Dataflow 1 Lecture: Eyeriss Dataflow

1 Lecture: Eyeriss Dataflow - PowerPoint Presentation

anastasia
anastasia . @anastasia
Follow
65 views
Uploaded On 2023-11-11

1 Lecture: Eyeriss Dataflow - PPT Presentation

Topics Eyeriss architecture and dataflow digital CNN accelerator We had previously seen basic ANNs that used tilingbuffersNFUs DianNao multiple chips to distribute the work and eliminate memory accesses ID: 1031092

row pes ifmap buffer pes row buffer ifmap energy convolution column global filter kernel reuse regfile partial sums computation

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Lecture: Eyeriss Dataflow" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. 1Lecture: Eyeriss Dataflow Topics: Eyeriss architecture and dataflow (digital CNN accelerator)We had previously seen basic ANNs that used tiling/buffers/NFUs (DianNao), multiple chips to distribute the work and eliminate memory accesses (DaDianNao), and a few examples that exploited sparsity (Deep compression, EIE, Cnvlutin). Now, we’ll examine a spatial architecture that exploits various dataflows to reduce energy overheads. Spatial architectures get inputs from their neighbors, thus helping computations that exhibit producer-consumer relationships and/or data reuse. The next slide shows the overall Eyeriss architecture. The chip has a global buffer that has a few hundred kilo-bytes of data. It then has an array of processing elements (PEs), connected by a mesh network.Each PE has a single MAC unit and a register file (that can store a few hundred values). There may also be FIFO input/output buffers within and between PEs, but we won’t discuss those today since they aren’t essential.

2. 2Dataflow OptimizationsData can be fetched from 4 different locations, each with varying energy cost: off-chip DRAM, global buffer, a neighboring PE’s regfile, and local regfile. The key here is to find a dataflow that can reduce the overall energy cost for a large CNN computation.

3. 3Overall Spatial Architecture

4. 4One PrimitiveLet’s first take a look at a single PE in Eyeriss. Let’s also first focus on a single 1D convolution (or the computation required by 1 row of a 2D convolution). This is defined as one primitive and one PE is responsible for one primitive. Before the computation starts, the PE loads its register file with 1 row of kernel weights (size R) and 1 row of an input feature map (size H). In the example above, a 3-entry kernel is applied on a 5-entry row. 9 computations are performed sequentially to produce 3 output pixels. The resulting outputs are only partial sums (the pink boxes labeled 1, 2, 3). This computation has a lot of reuse (kernel and ifmap), and all of that reuse is exploited thru the regfile.

5. 5One PrimitiveConsider a larger example. Take a row of the input feature map with 64 elements. Consider one row of the kernel with 3 elements. Once these 67 elements are in the regfile, we perform computations that produce 62 partial sum outputs. This is done with 186 MAC operations. The register file may or may not have to store these partial sums (depending on how bypassing is done).This is called row-stationary, i.e., a row sits in a PE and performs all required computations. The computations are done sequentially on a single MAC unit.

6. 6Dataflow for a 2D ConvolutionNow let’s carefully go through the 2D convolution. Let’s discuss “where” first; we’ll later discuss “when”. See the figures on the next slide to follow along.If the 2D convolution involves a 3x3 kernel, we’ll partition it into 3 “primitives”, with each primitive responsible for 1 row of the input feature map and 1 row of the kernel (btw, remember that we use the terms kernel/filter/weights interchangeably). Take the first column of PEs. PE1,1 receives filter row 1 and input feature map row 1. They produce partial sums for the first row of the output feature map. Similarly, PE2,1 and PE3,1 deal with rows 2 and 3 of the kernel/ifmap. The partial sums for all three PEs must be added to produce the first row of the ofmap (see the red dataflow).Next, to produce the second row of the ofmap, we’ll move to the second column of PEs. To produce the second row, the following rows must be convolved: ifmap-row2 and filter-row1; ifmap-row3 and filter-row2; ifmap-row4 and filter-row3. To make sure the right rows collide at the second column of PEs, the ifmap and filter rows follow the blue and green rows, as shown in the figure. And this process continues.To recap, inputs arrive at the first column. Each PE does a row’s worth of work (1 primitive). The partial sums in the column are added to produce the first row of the ofmap. Then, the ifmap and filter rows shift diagonally and laterally to the second column. The partial sums of the second column are aggregated to produce the second row of the ofmap. If we are working with a 64x64 ifmap and a 3x3 filter, we’ll need a 62x3 array of PEs.

7. 7Dataflow for a 2D ConvolutionThe next slide also shows the required computations for our 2D convolution example. We’ll refer to the first column and the last row of the PE array as “edge PEs”. To perform their computation, these PEs must first receive 64 ifmap elements and 3 filter elements. The 64 ifmap elements must be fetched from the global buffer. We next perform 186 MAC operations, each requiring regfile reads and regfile writes (for partial sums). When the final result is produced, it is sent to the neighboring PE. In non-edge PEs, the computations are similar. They key difference is that the 64 ifmap elements are read from the neighboring PE, not from the global buffer. We see here that most reads are from the local regfile or from a neighboring regfile; a few accesses are from the global buffer. We’ve thus created a dataflow that again exploits locality and reuse. This is in addition to the reuse we’ve already exploited within 1 primitive.Note that eventually, we want to do a 4D computation (lots of ifmaps, lots of ofmaps, multiple images). We did this by first doing a 1D computation efficiently in one PE. We then set up a grid of PEs on a 2D chip to process a 2D convolution efficiently. Unfortunately, we can’t have a 3D chip, so we have to stop with this 2D convolution primitive. To perform the 4D computation, we have to make repeated calls to a 2D convolution primitive with the following nested loops:for (images 120) for (ofmaps 18) for (ifmaps 14) perform 2D convolution

8. 8Dataflow for a 2D ConvolutionGiven these nested loops, we can compute the total number of DRAM/globalbuffer/PE/regfile accesses and compute energy. Note that the outputs of each 2D-conv are now partial sums and have to be aggregated as we walk through these nested loops. The compiler has to try different kinds of nested loops to figure out the best way to reduce energy.The entire 4D computation has to thus be carefully decomposed into multiple 2D convolutions (and each 2D convolution is itself decomposed into multiple 1D convolutions, 1 per PE). Further, the 62x3 PE grid that we just constructed has to be mapped on to the actual PE grid that exists on the chip (e.g., the fabricated Eyeriss chip has a 12x14 grid).Having addressed the “where”, let’s now talk about “when”. To simplify the discussion, I said that one column of PEs is fully processed before we move to the next column. In reality, once an element of the ifmap arrives at a PE, we can also pass it to the next PE column in the next cycle. So the second column can start one cycle after the first column. We are thus pipelining the computations in all the columns. Once we reach steady state, all the columns will be busy, thus achieving very high parallelism.

9. 9Row Stationary Dataflow for one 2D ConvolutionExample: 4 64x64 inputs; 4x3x3 kernel wts; 8 62x62 outputs; 20 image batch Edge prim: (glb) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE) Other prims: (PE) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE) The first step is done ~64 times; the second step is done ~122 times Eventually: 4K outputs to global buffer or DRAM

10. 10Folding May have to fold a 2D convolution over a small physical set of PEs Must eventually take the 4D convolution and fold it into multiple 2D convolutions – the 2D convolution has to be done C (input filters) x M (output filters) x N (image batch) times Can exploit global buffer reuse and register reuse depending on which order you do this (note that you have to deal with inputs, weights, and partial sums)

11. 11Weight StationaryExample: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch Weight Stationary: the weights stay in place Assume that we lay out 3x3 weights across 9 PEs Let the inputs stream over these – each weight has to be seen 61x61 times -- no easy way to move the pixels around to promote thisThere are other reasonable dataflows as well. In weight stationary, the 9 weights in a filter may occupy a 3x3 grid of PEs. The ifmap has to flow through these PEs. Note that each element in the ifmap has to combine with each of the weights, so there’s plenty of required data movement.

12. 12Output StationaryExample: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch Output Stationary: the output neuron stays in place Need to use all PEs to compute a subset of a 4D space of neurons at a time – can move inputs around to promote reuseIn output stationary, each PE is responsible for producing one output value. So all the necessary ifmap and filter values have to find their way to all necessary PEs.

13. 13Terminology

14. 14Energy Estimates Most ops in convolutional layers Most energy in ALU and RF in convs; most energy in buffer in FC More storage = more delayThe first figure shows that the CONV layers account for 90% of all energy. Most energy is in the RF and ALUs. The FC layers have lower reuse and therefore see more energy in the global buffer. The optimal design has a good balance of storage and PEs.

15. 15Summary All about reduced energy and data movement; assume PEs are busy most of the time (except edge effects) Reduced data movement  low energy and low area (from fewer interconnects) While Row Stationary is best, need a detailed design space exploration to identify the best traversal thru the 4D array It’s not always about reducing DRAM accesses; even global buffer accesses must be reduced More PEs allows for better data reuse, so not terrible even if it means smaller global buffer Convs are 90% of all ops and growing Their best designs are with 256 PEs, 0.5KB regfile/PE, 128KB global buffer; filter/psum/act = 224/24/12

16. 16WAXRecent work makes the argument that the storage hierarchy in Eyeriss is sub-optimal. Accessing a 0.5KB register file is expensive, as is accessing a 108KB global buffer. A more efficient hierarchy (as shown here) is to have a handful of registers per PE and a local buffer of size 8KB. This is one tile and if more data is required, it is fetched from nearby tiles. This is also an example of near-data processing where an array of 32 MACs is placed next to an 8KB buffer (subarray). A row of 32 values read from the buffer gets placed in either a W register (weights) or A register (activations). These registers feed the 32 MACs in parallel in one cycle. In the next cycle, the A register performs a shift operation. This enables high reuse while performing a convolution operation. The WAX paper constructs a number of dataflows to reduce the overall energy required by a deep network.

17. 17References “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” Y-H. Chen et al., ISCA 2016 “Wire-Aware Architecture and Dataflow for CNN Accelerators,” S. Gudaparthi et al., MICRO 2019