/
Overcoming Resource Underutilization in Spatial CNN Acceler Overcoming Resource Underutilization in Spatial CNN Acceler

Overcoming Resource Underutilization in Spatial CNN Acceler - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
393 views
Uploaded On 2017-03-17

Overcoming Resource Underutilization in Spatial CNN Acceler - PPT Presentation

Yongming Shen Michael Ferdman Peter Milder COMPAS Lab Stony Brook University CNN on FPGAs Convolutional Neural Networks CNNs Best known method for object recognition Simonyan ID: 525475

layer util conv clp util layer clp conv fpga utilization cnn dsp fpgas dynamic slices 100 parallel amenable single

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Overcoming Resource Underutilization in ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Overcoming Resource Underutilization in Spatial CNN Accelerators

Yongming Shen

, Michael

Ferdman

, Peter Milder

COMPAS Lab, Stony Brook UniversitySlide2

CNN on FPGAs

Convolutional Neural Networks (CNNs)

Best known method for object recognition [

Simonyan, ICLR2015]Massive amount of computationHighly parallel, amenable to hardware accelerationFPGA-based CNN AccelerationLots of compute units (DSP slices) Exploit parallelismReconfigurable fabric CNN-tailored caches and dataflowEnergy efficient

CNNs are highly parallel and amenable to FPGA acceleration

2

Conv. Layer 2

Conv.

Layer 1

Conv. Layer 0

DOGSlide3

Conv. Layer 2

Conv. Layer 0

Conv.

Layer 1

45% Util.

100% Util.

100% Util.

State-of-the-art: whole

FPGA

one

processor [Zhang, FPGA2015]

Single-CLP (Convolutional Layer Processor)

Dimension mismatch causes dynamic underutilizationPart of the compute array becomes idleOnly 74% average utilization

with AlexNet on Virtex-7 485TLess than 30% average utilization on larger FPGAs (~10K DSP slices)

40% Util.

The Underutilization Problem

“One size fits all”

low

dynamic arithmetic-unit utilization

3

FPGA

CLPSlide4

Our Multi-CLP Solution

Single large CLP

 Multiple smaller CLPs

Total DSP-slice consumption remains the sameEach CLP optimized for a subset of layersMultiple images in flight at the same timeAvoids data dependency between CLPsAlexNet on Virtex-7 485T (simulation)74% util.  97% util., 1.3x speedupAlexNet on FPGAs with ~10K DSP slices (model)

30% util.  96%

util., 3.2x speedup

Specialization

 ~100% dynamic arithmetic-unit utilization

4

FPGA

CLP

FPGA

CLP0

CLP1

CLP2