/
Polly-ACC: Transparent Compilation to Heterogeneous Hardwar Polly-ACC: Transparent Compilation to Heterogeneous Hardwar

Polly-ACC: Transparent Compilation to Heterogeneous Hardwar - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
393 views
Uploaded On 2017-07-10

Polly-ACC: Transparent Compilation to Heterogeneous Hardwar - PPT Presentation

Tobias Grosser Torsten Hoefler 1 LLVM Workshop CGO17 February 4 2017 Austin TX Johannes Doerfert University of Saarbruecken ID: 568571

image gpu data polly gpu image polly data offset cpu output performance ptr high memory grosser automatic input kernel

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Polly-ACC: Transparent Compilation to He..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Polly-ACC: Transparent Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler

1

LLVM Workshop @CGO’17 February 4, 2017 | Austin - TX

Johannes

Doerfert

University of

Saarbruecken

Michael Kruse, Albert Cohen, Sven

Verdoolaege

Polly Labs

Yabin

Hu

China University of

Geoscience

… many others

Swiss Universities / PASC

ARM,

Qualcomm,

XilinxSlide2

2

row

=

0; output_image_ptr = output_image; output_image_ptr += (NN * dead_rows);

for (r = 0;

r < NN -

KK

+

1

;

r

++)

{ output_image_offset = output_image_ptr; output_image_offset += dead_cols; col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel;S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) {S1: temp1 = *input_image_offset++;S1: temp2 = *kernel_offset++;S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; }S2: *output_image_offset = ((*output_image_offset)/ normal_factor); output_image_offset++ ; col++; } output_image_ptr += NN; row++; }}

FortranC/C++

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Multi-Core CPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Accelerator

Sequential Software

Parallel Hardware

Development

Maintenance

Performance TuningSlide3

Non-Goal:Algorithmic Changes3

Design Goals

Automatic

“Regression Free”High PerformanceAutomatic Accelerator Mapping

-How close can we get?

Automatic Accelerator

M

apping

-

How close can we get? Slide4

4Polly-ACC: Architecture

Polly-ACC: Transparent

Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul

GSoC ‘11Justin Holewinski(OSU/ today NVIDIA)GSoC ‘16Matthias Reisinger(TU Wienna)GSoC ‘10Hongbin Zheng(Xilinx)GSoC ’12/14Yabin HuSlide5

5How to get Polly-ACC

cd <

llvm-src>cd tools; git

clone llvm.org/polly.gitcd <llvm-build>; make clang –O3 –mllvm –polly –mllvm –polly-target=gpu -lGPURuntime –ldl

-o cpu-gpu

-hybrid-prog

file.c

./

cpu

-

gpu-hybrid-prog Slide6

4Polyhedral Loop Modeling

Iteration Space

0

12345j

i

5

4

3

2

1

0

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (

i,j

) | 0 ≤ i ≤ N ∧ 0

≤ j ≤ i }

Program Code

for

(i = 0; i <= N; i++)

for

(j = 0; j <= i; j++)

S(i,j);

i

= 0, j = 1

i

= 4, j = 4

i

= 4, j = 3

i

= 4, j = 2

i

= 3, j = 3

i

= 4, j = 0

i

= 3, j = 0

i

= 2, j = 0

i

= 1, j = 0

i

= 4, j = 1

i

= 2, j = 1

i

= 1, j = 1

i

= 2, j = 2

i

= 3, j = 1

i

= 3, j = 2

Automatic

Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation

Tobias Grosser, Armin

Groesslinger

, and Christian

Lengauer

in Parallel Processing Letters (PPL), April, 2012Slide7

7

Mapping Computation to Device

[-

mllvm -polly-target=gpu]

0

1

2

0

1

2

0

1

2

3

0

1

2

3

0

0

1

1

Device Blocks & Threads

Iteration

Space

 

0

1

1

0

i

j

 

AutomaticSlide8

8

Memory Hierarchy of a Heterogeneous System

Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven

et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide9

9

Host-device Date Transfers

Automatic

Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide10

10

Host-device Date Transfers

[-lGPURuntime -lld]

AutomaticPolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide11

11

Mapping to Fast Memory

[-mllvm -polly-acc

-use-shared]High-PerformancePolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide12

12

Polyhedral

Parallel

Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013High-PerformanceMapping to Fast Memory[-mllvm -polly-acc-use-private]Slide13

13

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide14

14

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide15

15

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide16

16

Accessed Data (for a 2x2 thread block)

[-delinearize]

for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Data needed on device12 elementsMinimal data, but complex transferOptimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Pop

at Internnational Conference on Supercomputing (ICS), 2015Slide17

17

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];One-dimensional hull20 elementsSimple transfer, but redundant dataAccessed Data (for a 2x2 thread block)[-

delinearize]

Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide18

18

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Two-dimensional hull16 elementsSimple transfer, less redundant dataModeling multi-dimensional accessbehavior is important

Accessed Data (for a 2x2 thread block)

[-

delinearize

]

“Optimistic Loop Optimization” at CGO’17

with Johannes

Doerfert

and Sebastian Hack

Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide19

a

b

c

defghi19Access Coalescing

a

b

c

d

e

f

g

h

iDRAMHigh-PerformanceSlide20

a

b

c

defghi

a

b

c

d

e

f

g

h

i20Access Coalescing (no permute)abcdefghiDRAMShared Memoryabcdefghi

High-PerformanceSlide21

a

d

g

behcfia

b

c

d

e

f

g

h

i

21Access Coalescing (permuted)abcdefghiDRAMShared Memoryadgbeh

cf

i

High-PerformanceSlide22

Profitability Heuristic[-mllvm -

polly-process-unprofitable=false]

Trivial

Unsuitable

Insufficient Compute

static

dynamic

Modeling

Execution

All

Loop

NestsGPURegression-FreeSlide23

23Kernels to Programs – Data Transfers

void

heat(int n, float A[n], float

hot, float cold) {

float

B[n] = {0};

initialize(n

, A, cold

);

setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); }}High-PerformanceSlide24

24Data Transfer – Per Kernel

Host Memory

initialize()setCenter()

average()average()average() 

 

 

 

time

 

 

 

 

Device Memory

High-PerformanceSlide25

25

Data Transfer – Inter Kernel Caching

[POLLY_CACHE_ALLOCATIONS=1]

Host Memory

 

Host Memory

initialize

()

setCenter

()

average()

average()average()time  Device MemoryHigh-PerformanceSlide26

26Evaluation

Evaluation

Workstation: 10 core

SandyBridge NVIDIA Titan Black (Kepler)Slide27

27

LLVM Nightly Test Suite

# Compute Regions / Kernels Slide28

28

Polybench 3.2 Computional

KernelsBaseline: icc –O3 (sequential), 10 core CPU + NVIDIA Titan Black (workstation)Slide29

29

Lattice Boltzmann (SPEC 2006)Slide30

30

Cactus ADM (SPEC 2006) - PerformanceSlide31

31Cactus ADM (SPEC 2006) - Data TransferSlide32

Polly-ACC32

Automatic

“Regression Free”

High Performancehttp://spcl.inf.ethz.ch/Polly-ACC