Mohammadsadegh Sadri Christian Weis Norbert When and Luca Benini Department of Electrical Electronic and Information Engineering DEI University of Bologna Italy Microelectronic Systems Design Research Group University of Kaiserslautern ID: 631786
Download Presentation The PPT/PDF document "Energy and Performance Exploration of Ac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ
Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, ItalyMicroelectronic Systems Design Research Group, University of Kaiserslautern, Germany{mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de
ver0Slide2
2Outline
Experimental ResultsMemory Sharing MethodsInfrastructure Setup (Hardware & Software) Motivations & ContributionsZYNQ Architecture (Brief)
Introduction
Lessons Learned & ConclusionSlide3
(c) Luca Bedogni 2012IntroductionPerformance Per Watt!!1951UNIVAC I : 0.015 operations per 1 watt-second
2012Half a century later!ST P2012 : 40 billion operations per 1 watt-secondSlide4
Accelerator(specialized hardware)Accelerator(specialized hardware)Introduction Solution : Specialized functional units (Accelerators)
CPUL1$DRAMCase 1TASK 1TASK 2TASK 3TASK 4var1var2var3var1var2
cached
Case 2
Faster!
More Power Efficient!
Better Performance Per Watt!
What about Variables?
?????
CPU should
Flush the cache!
- Problem can be more complicated!
e.g.
Multiple
CPU cores!
- Every processing element:
Should have a
consistent view
of the
shared memory
!
- Accelerator Coherency Port (ACP)
:
Allows accelerator hardware
To Perform
coherent accesses
To CPU(s) memory space!Slide5
5OCMPLPS ARM A9NEON MMUARM A9NEONMMUL1L1SnoopL2PL310DRAM Controller(Synopsys IntelliDDR MPMC)Peripherals (UART, USB, Network, SD, GPIO,…)Inter
Connect(ARMNIC-301)HP0HP1HP2HP3SGP0SGP1MGP0MGP1AXIMasters
AXISlaves
AXI Master
ACP
DMA Controller (ARM PL330)
Xilinx ZYNQ ArchitectureSlide6
6OCMPLPS DRAM ControllerHP0AXI Master(Accelerator)ACPL2PL310
Motivations & ContributionsWhich method is better to share data between CPU and Accelerator?ARM A9NEON MMUARM A9NEONMMUL1L1SnoopFor each method,What is the data transfer speed?How much is the energy consumption?Effect of background workload on performance?Various acceleration methods are addressed in the literature (GPU, hardware boards, …) We develop an infrastructure (HW+SW)For the Xilinx ZYNQWe run practical tests & measurementsTo quantify the efficiency of different CPU-acceleratormemory sharing methods. Slide7
7HardwareSlide8
8SoftwareLinux Kernel LevelDrivers AXI DummyDriverAXI DriverSimple driver:Initializes the dummy AXI masters (HP1)Triggers an endless read/write loopMore complicated:Handles AXI masters ACP & HP0Memory allocationISR registrationstatistics PL310time measurement
Over ACP: kmallocOver HP: dma_alloc_coherentAXI Driver user side interface applicationBackground application: A Simple memory read/write loopOprofile statistical profiler.Measure all CPU performance metrics.Slide9
9Source Image(image_size bytes)@Source AddressFIRResult Image(image_size bytes)@Dest AddressreadprocesswriteLoop: N timesMeasure execution interval. FIFO: 128K
128KSelection of Pakcets:(Addressing)- Normal- Bit-reversedAllocated by: kmallocdma_alloc_coherentDepends on the memorySharing methodImage Sizes:4KBytes16K65K128K256K1MBytes2MBytesWe define : Different methods to accomplish the task.Measure : Execution time & Energy.Processing Task DefinitionSlide10
10Memory Sharing MethodsAcceleratorACPSCUL2DRAM
ACP Only (HP only is similar, there is no SCU and L2)CPU only (with&without cache)CPU ACP(CPU HP similar)AcceleratorACPSCUL2DRAM
CPU
1
2
ACP ---
CPU ---
ACP ---Slide11
11Speed Comparison256K1MBytes128K64K16K4K
ACP Loses!298MBytes/s239MBytes/sCPU OCM between CPU ACP & CPU HPSlide12
12Dummy Traffic Effect256KHP: 1382Mbytes/sACP: 1664Mbytes/sCPU dummy trafficOccupies cache entriesSo less free entries remain for the accelerator Slide13
13Power ComparisonSlide14
14Energy ComparisonCPU only methods : worst case!CPU ACP ; always better energy than CPU HP0When the image size grows CPU ACP converges CPU HP0CPU OCM always between CPU ACP and CPU HPSlide15
15Lessons Learned & ConclusionIf a specific task should be done by the cooperation ofCPU and accelerator:CPU ACP and CPU OCM are always better than CPU HP in terms of energyIf we are running other applications whichheavily depend on caches, CPU OCM and then CPU HP are preferred!If a specific task should be done by accelerator only:For small arrays ACP Only & OCM Only can be usedFor large arrays (>size of L2$) HP Only always acts better.