/
Integration for Heterogeneous SoC Modeling Integration for Heterogeneous SoC Modeling

Integration for Heterogeneous SoC Modeling - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
390 views
Uploaded On 2018-11-04

Integration for Heterogeneous SoC Modeling - PPT Presentation

Yakun Sophia Shao Sam Xi GuYeon Wei David Brooks Harvard University 1 Todays AcceleratorCPU Integration 2 Simple interface to accelerators DMA Easy to integrate lots of IP Hard to program and share data ID: 714581

dma work accelerator type work dma type accelerator data 512 design cache system gem5 memory caches trace aladdin partition

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Integration for Heterogeneous SoC Modeli..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Integration for Heterogeneous SoC Modeling

Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David BrooksHarvard University

1Slide2

Today’s Accelerator-CPU Integration2

Simple interface to accelerators: DMAEasy to integrate lots of IPHard to program and share dataCoreL1 $…L2 $Core

L1 $

DMA

On-Chip System Bus

Acc

#1

SPAD

Acc

#n

SPAD

DRAMSlide3

Core

L1 $…L2 $CoreL1 $DMA

On-Chip System Bus

DRAM

Today’s Accelerator-CPU Integration

3

Simple interface to accelerators: DMA

Easy to integrate lots of IP

Hard to program and share data

Acc

#1

SPAD

Acc

#n

SPADSlide4

Typical DMA FlowFlush and invalidate input data from CPU caches.

Invalidate a region of memory to be used for receiving accelerator output.Program a buffer descriptor describing the transfer (start, length, source, destination).When data is large, program multiple descriptorsInitiate accelerator.Initiate data transfer.Wait for accelerator to complete.4Slide5

DMA can be very expensive5

16-way parallel md-knn acceleratorOnly 20% of total time!Slide6

Co-Design vs. Isolated Design6Slide7

Co-Design vs. Isolated Design7Slide8

Co-Design vs. Isolated Design8

No need to build such an aggressively parallel design!Slide9

gem5-Aladdin: An SoC Simulator

9Slide10

FeaturesEnd-to-end simulation of accelerated workloads.Models hardware-managed caches and DMA + scratchpad memory systems.

Supports multiple accelerators.Enables system-level studies of accelerator-centric platforms.Xenon: A powerful design sweep system.Highly configurable and extensible.10Slide11

DMA EngineExtend the existing DMA engine in gem5 to accelerators.Special dmaLoad and

dmaStore functions.Insert into accelerated kernel.Trace will capture them.gem5-Aladdin will handle them.Currently a timing model only.Analytical model for cache flush and invalidation latency.11Slide12

DMA Engine/* Code representing the accelerator */

void fft1D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dmaLoad(&work_x[0], 0, 512 *

sizeof(TYPE));

dmaLoad(&

work_y[0

],

0

,

512

*

sizeof

(TYPE));

/* Run FFT here ... */

dmaStore

(&

work_x

[

0

], 0,

512 * sizeof

(TYPE)); dmaStore(&work_y

[0], 0,

512 * sizeof

(TYPE));

}

12Slide13

Caches and Virtual MemoryGaining traction on multiple platforms.Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP)

IBM POWER8’s Coherent Accelerator Processor Interface (CAPI)System vendors provide a Host Service Layer with virtual memory and cache coherence support.Host service layer communicates with CPUs through an agent.13CoreL1 $…

L2 $

Core

L1 $

Acc

Agent

Host Service Layer

Accelerator

Processors

FPGA

QPI/

PCIeSlide14

Caches and Virtual MemoryAccelerator caches are connected directly to system bus.Support for multi-level cache hierarchies.Hybrid memory system: can use both caches and scratchpads.Basic MOESI coherence protocol.Special Aladdin TLB model.

Map trace address space to simulated address space.14Slide15

Demo: DMAExercise: change system bus width and see effect on accelerator performance.Open up your VM.Go to:~gem5-aladdin/sweeps/tutorial/

dma/stencil-stencil2d/0Examine these files:stencil-stencil2d.cfg../inputs/dynamic_trace.gzgem5.cfgrun.sh15Slide16

Demo: DMARun the accelerator with DMA simulationChange the system bus width to 32 bitsSet xbar_width

=4 in run.shRun again.Compare results.16Slide17

Demo: CachesExercise: see effect of cache size on accelerator performance.Go to:~gem5-aladdin/sweeps/tutorial/cache/stencil-stencil2d/0

Examine these files:../inputs/dynamic_trace.gzstencil-stencil2d.cfggem5.cfg17Slide18

Demo: CachesRun the accelerator with caches simulationChange the cache size to 1kB.Set

cache_size = 1kB in gem5.cfg.Run again.Compare results.Play with some other parameters (associativity, line size, etc.)18Slide19

CPU – Accelerator CosimulationCPU can invoke an attached accelerator.We use the ioctl system call.

Communicate status through shared memory.Spin wait for accelerator, or do something else (e.g. start another accelerator).19Slide20

Code example/* Code running on the CPU. */void

run_benchmark(TYPE work_x[512], TYPE work_y[512]) { }20Slide21

Code example/* Code running on the CPU. */void

run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x));

}

21

Starting address and length of one memory region that the accelerator can access.

Associate this array name with the addresses of memory accesses in the trace.

ioctl

request codeSlide22

Code example/* Code running on the CPU. */void

run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x));

mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE,

"work_y

"

,

work_y

,

sizeof

(

work_y

));

}

22Slide23

Code example/* Code running on the CPU. */void

run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x));

mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE,

"work_y

"

,

work_y

,

sizeof

(

work_y

));

// Start the accelerator and spin until it finishes.

invokeAcceleratorAndBlock

(MACHSUITE_FFT_TRANSPOSE);

}

23Slide24

Demo: disparityYou can just watch for this one.If you want to follow along:~/gem5-aladdin/sweeps/tutorial/

cortexsuite_sweep/0This is a multi-kernel, CPU + accelerator cosimulation.24Slide25

How can I use gem5-Aladdin?Investigate optimizations to the DMA flow.Study cache-based accelerators.Study impact of system-level effects on accelerator design.Multi-accelerator systems.Near-data processing.

All these will require design sweeps!25Slide26

Xenon: Design Sweep SystemA small declarative command language for generating design sweep configurations.Implemented as a Python embedded DSL.Highly extensible.Not gem5-Aladdin specific.Not limited to sweeping parameters on benchmarks.

Why “Xenon”?26Slide27

Xenon: Generation Procedure

27Slide28

Xenon: Data Structures

28cycle_time pipeliningunrollingpartition_typepartition_factormemory_typeExample of a Python data structure that Xenon will operate on.A benchmark suite would contain many of these.Slide29

Xenon: Commandsset unrolling

4set partition_type “cyclic”set unrolling for md_knn.* 8set partition_type for md_knn.force_x “block”sweep cycle_time from

1

to 5

sweep partition_factor

from

1

to

8

expstep

2

set

partition_factor

for

md_knn.force_x

8

generate

configsgenerate trace

29Slide30

Xenon: Execute "Benchmark(\"md-

knn\")": { "Array(\"NL\")": { "memory_type": "cache", "name": "NL", "partition_factor": 1, "partition_type": "cyclic

", "

size":

4096, "

type

": "

Array

",

"

word_length

":

8

},

"

Array(\"

force_x

\")

":

{

"memory_type": "

cache", "

name": "force_x", "

partition_factor": 1,

"partition_type": "

cyclic

",

"

size

":

256

,

"

type

": "

Array

",

"

word_length

":

8

},

"

Array(\"

force_y

\")

":

{

"

memory_type

": "

cache

",

"

name

": "

force_y

",

"

partition_factor

":

1

,

"

partition_type

": "

cyclic

",

"

size

":

256

,

"

type

": "

Array

",

"

word_length

":

8

}

...

30

Every configuration in a JSON file.

A backend is then invoked to load this JSON object and write application specific

config

files.Slide31

Enough talking…let’s do a demo31Slide32

Demo: Design Sweeps with XenonExercise: sweep some parameters.Go to:~

gem5-aladdin/sweeps/tutorial/Examine these files:../inputs/dynamic_trace.gzstencil-stencil2d.cfggem5.cfg32Slide33

Tutorial ReferencesY.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin”, MICRO, 2016.

Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators”, SCAW, 2015.Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13.B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14.B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14.33Slide34

Backup slides34Slide35

ValidationImplemented accelerators in Vivado HLSDesigned complete system in Vivado Design Suite 2015.1.

35Slide36

Reducing DMA Overhead36Slide37

Reducing DMA Overhead37Slide38

Reducing DMA Overhead38Slide39

DMA Optimization Results39

Overlap of flush and data transferSlide40

DMA Optimization Results40

Overlap of data transfer and computeSlide41

DMA Optimization Results41

md-knn is able to completely overlap compute with data transfer.Slide42

DMA vs Caches42

DMAPush-based data accessBulk transfer efficiencySimple hardware, lower powerManual coherence managementManual optimizationsCoarse grained communicationCacheOn-demand data accessAutomatic coherence handlingFine grained communicationAutomatic evictionLarger, higher power costLess efficient at bulk data movement.Slide43

DMA vs Caches43

DMA is faster and lower power than caches.Regular access patternsSmall input data sizeCaches will have cold misses.Slide44

DMA vs Caches44

DMA and caches are approximately equal.Power is dominated by floating point units.Slide45

DMA vs Caches45

DMA is faster and lower power than caches.Regular access patternsSmall input data sizeCaches will have cold misses.