/
Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
396 views
Uploaded On 2018-02-15

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ - PPT Presentation

Mohammadsadegh Sadri Christian Weis Norbert When and Luca Benini Department of Electrical Electronic and Information Engineering DEI University of Bologna Italy Microelectronic Systems Design Research Group University of Kaiserslautern ID: 631786

acp cpu accelerator amp cpu acp amp accelerator ocm image hardware energy task size axi watt methods performance sharing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Energy and Performance Exploration of Ac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, ItalyMicroelectronic Systems Design Research Group, University of Kaiserslautern, Germany{mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de

ver0Slide2

2Outline

Experimental ResultsMemory Sharing MethodsInfrastructure Setup (Hardware & Software) Motivations & ContributionsZYNQ Architecture (Brief)

Introduction

Lessons Learned & ConclusionSlide3

(c) Luca Bedogni 2012IntroductionPerformance Per Watt!!1951UNIVAC I : 0.015 operations per 1 watt-second

2012Half a century later!ST P2012 : 40 billion operations per 1 watt-secondSlide4

Accelerator(specialized hardware)Accelerator(specialized hardware)Introduction Solution : Specialized functional units (Accelerators)

CPUL1$DRAMCase 1TASK 1TASK 2TASK 3TASK 4var1var2var3var1var2

cached

Case 2

Faster!

More Power Efficient!

Better Performance Per Watt!

What about Variables?

?????

CPU should

Flush the cache!

- Problem can be more complicated!

e.g.

Multiple

CPU cores!

- Every processing element:

Should have a

consistent view

of the

shared memory

!

- Accelerator Coherency Port (ACP)

:

Allows accelerator hardware

To Perform

coherent accesses

To CPU(s) memory space!Slide5

5OCMPLPS ARM A9NEON MMUARM A9NEONMMUL1L1SnoopL2PL310DRAM Controller(Synopsys IntelliDDR MPMC)Peripherals (UART, USB, Network, SD, GPIO,…)Inter

Connect(ARMNIC-301)HP0HP1HP2HP3SGP0SGP1MGP0MGP1AXIMasters

AXISlaves

AXI Master

ACP

DMA Controller (ARM PL330)

Xilinx ZYNQ ArchitectureSlide6

6OCMPLPS DRAM ControllerHP0AXI Master(Accelerator)ACPL2PL310

Motivations & ContributionsWhich method is better to share data between CPU and Accelerator?ARM A9NEON MMUARM A9NEONMMUL1L1SnoopFor each method,What is the data transfer speed?How much is the energy consumption?Effect of background workload on performance?Various acceleration methods are addressed in the literature (GPU, hardware boards, …) We develop an infrastructure (HW+SW)For the Xilinx ZYNQWe run practical tests & measurementsTo quantify the efficiency of different CPU-acceleratormemory sharing methods. Slide7

7HardwareSlide8

8SoftwareLinux Kernel LevelDrivers AXI DummyDriverAXI DriverSimple driver:Initializes the dummy AXI masters (HP1)Triggers an endless read/write loopMore complicated:Handles AXI masters ACP & HP0Memory allocationISR registrationstatistics PL310time measurement

Over ACP: kmallocOver HP: dma_alloc_coherentAXI Driver user side interface applicationBackground application: A Simple memory read/write loopOprofile statistical profiler.Measure all CPU performance metrics.Slide9

9Source Image(image_size bytes)@Source AddressFIRResult Image(image_size bytes)@Dest AddressreadprocesswriteLoop: N timesMeasure execution interval. FIFO: 128K

128KSelection of Pakcets:(Addressing)- Normal- Bit-reversedAllocated by: kmallocdma_alloc_coherentDepends on the memorySharing methodImage Sizes:4KBytes16K65K128K256K1MBytes2MBytesWe define : Different methods to accomplish the task.Measure : Execution time & Energy.Processing Task DefinitionSlide10

10Memory Sharing MethodsAcceleratorACPSCUL2DRAM

ACP Only (HP only is similar, there is no SCU and L2)CPU only (with&without cache)CPU ACP(CPU HP similar)AcceleratorACPSCUL2DRAM

CPU

1

2

ACP ---

CPU ---

ACP ---Slide11

11Speed Comparison256K1MBytes128K64K16K4K

ACP Loses!298MBytes/s239MBytes/sCPU OCM between CPU ACP & CPU HPSlide12

12Dummy Traffic Effect256KHP: 1382Mbytes/sACP: 1664Mbytes/sCPU dummy trafficOccupies cache entriesSo less free entries remain for the accelerator Slide13

13Power ComparisonSlide14

14Energy ComparisonCPU only methods : worst case!CPU ACP ; always better energy than CPU HP0When the image size grows CPU ACP converges CPU HP0CPU OCM always between CPU ACP and CPU HPSlide15

15Lessons Learned & ConclusionIf a specific task should be done by the cooperation ofCPU and accelerator:CPU ACP and CPU OCM are always better than CPU HP in terms of energyIf we are running other applications whichheavily depend on caches, CPU OCM and then CPU HP are preferred!If a specific task should be done by accelerator only:For small arrays ACP Only & OCM Only can be usedFor large arrays (>size of L2$) HP Only always acts better.