Polly-ACC: Transparent Compilation to Heterogeneous Hardwar - PowerPoint Presentation

393 views
Uploaded On 2017-07-10

Polly-ACC: Transparent Compilation to Heterogeneous Hardwar - PPT Presentation

Tobias Grosser Torsten Hoefler 1 LLVM Workshop CGO17 February 4 2017 Austin TX Johannes Doerfert University of Saarbruecken ID: 568571

image gpu data polly gpu image polly data offset cpu output performance ptr high memory grosser automatic input kernel

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/568571" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Polly-ACC: Transparent Compilation to He..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Polly-ACC: Transparent Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler

LLVM Workshop @CGO’17 February 4, 2017 | Austin - TX

Johannes

Doerfert

University of

Saarbruecken

Michael Kruse, Albert Cohen, Sven

Verdoolaege

Polly Labs

Yabin

China University of

Geoscience

… many others

Swiss Universities / PASC

ARM,

Qualcomm,

XilinxSlide2

row

0; output_image_ptr = output_image; output_image_ptr += (NN * dead_rows);

for (r = 0;

r < NN -

;

++)

{ output_image_offset = output_image_ptr; output_image_offset += dead_cols; col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel;S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) {S1: temp1 = *input_image_offset++;S1: temp2 = *kernel_offset++;S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; }S2: *output_image_offset = ((*output_image_offset)/ normal_factor); output_image_offset++ ; col++; } output_image_ptr += NN; row++; }}

FortranC/C++

CPU

Multi-Core CPU

GPU

Accelerator

Sequential Software

Parallel Hardware

Development

Maintenance

Performance TuningSlide3

Non-Goal:Algorithmic Changes3

Design Goals

Automatic

“Regression Free”High PerformanceAutomatic Accelerator Mapping

-How close can we get?

Automatic Accelerator

apping

How close can we get? Slide4

4Polly-ACC: Architecture

Polly-ACC: Transparent

Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul

GSoC ‘11Justin Holewinski(OSU/ today NVIDIA)GSoC ‘16Matthias Reisinger(TU Wienna)GSoC ‘10Hongbin Zheng(Xilinx)GSoC ’12/14Yabin HuSlide5

5How to get Polly-ACC

cd <

llvm-src>cd tools; git

clone llvm.org/polly.gitcd <llvm-build>; make clang –O3 –mllvm –polly –mllvm –polly-target=gpu -lGPURuntime –ldl

-o cpu-gpu

-hybrid-prog

file.c

cpu

gpu-hybrid-prog Slide6

4Polyhedral Loop Modeling

Iteration Space

12345j

N = 4

j ≤ i

i ≤ N = 4

0 ≤ j

0 ≤ i

D = { (

i,j

) | 0 ≤ i ≤ N ∧ 0

≤ j ≤ i }

Program Code

for

(i = 0; i <= N; i++)

for

(j = 0; j <= i; j++)

S(i,j);

= 0, j = 1

= 4, j = 4

= 4, j = 3

= 4, j = 2

= 3, j = 3

= 4, j = 0

= 3, j = 0

= 2, j = 0

= 1, j = 0

= 4, j = 1

= 2, j = 1

= 1, j = 1

= 2, j = 2

= 3, j = 1

= 3, j = 2

Automatic

Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation

Tobias Grosser, Armin

Groesslinger

, and Christian

Lengauer

in Parallel Processing Letters (PPL), April, 2012Slide7

Mapping Computation to Device

mllvm -polly-target=gpu]

Device Blocks & Threads

Iteration

Space

AutomaticSlide8

Memory Hierarchy of a Heterogeneous System

Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven

et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide9

Host-device Date Transfers

Automatic

Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide10

Host-device Date Transfers

[-lGPURuntime -lld]

AutomaticPolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide11

Mapping to Fast Memory

[-mllvm -polly-acc

-use-shared]High-PerformancePolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide12

Polyhedral

Parallel

Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013High-PerformanceMapping to Fast Memory[-mllvm -polly-acc-use-private]Slide13

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide14

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

Accessed Data (for a 2x2 thread block)

[-delinearize]

for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Data needed on device12 elementsMinimal data, but complex transferOptimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Pop

at Internnational Conference on Supercomputing (ICS), 2015Slide17

for (i = 1;

i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i

][j-1] + A[i][j+1];One-dimensional hull20 elementsSimple transfer, but redundant dataAccessed Data (for a 2x2 thread block)[-

delinearize]

Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide18

Accessed Data (for a 2x2 thread block)

for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++)

… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Two-dimensional hull16 elementsSimple transfer, less redundant dataModeling multi-dimensional accessbehavior is important

Accessed Data (for a 2x2 thread block)

delinearize

]

“Optimistic Loop Optimization” at CGO’17

with Johannes

Doerfert

and Sebastian Hack

defghi19Access Coalescing

iDRAMHigh-PerformanceSlide20

defghi

i20Access Coalescing (no permute)abcdefghiDRAMShared Memoryabcdefghi

High-PerformanceSlide21

behcfia

21Access Coalescing (permuted)abcdefghiDRAMShared Memoryadgbeh

High-PerformanceSlide22

Profitability Heuristic[-mllvm -

polly-process-unprofitable=false]

Trivial

Unsuitable

Insufficient Compute

static

dynamic

Modeling

Execution

All

Loop

NestsGPURegression-FreeSlide23

23Kernels to Programs – Data Transfers

void

heat(int n, float A[n], float

hot, float cold) {

float

B[n] = {0};

initialize(n

, A, cold

);

setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); }}High-PerformanceSlide24

24Data Transfer – Per Kernel

Host Memory

initialize()setCenter()

average()average()average()

time

Device Memory

High-PerformanceSlide25

Data Transfer – Inter Kernel Caching

[POLLY_CACHE_ALLOCATIONS=1]

Host Memory

initialize

()