Tobias Grosser Torsten Hoefler 1 LLVM Workshop CGO17 February 4 2017 Austin TX Johannes Doerfert University of Saarbruecken ID: 568571
Download Presentation The PPT/PDF document "Polly-ACC: Transparent Compilation to He..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Polly-ACC: Transparent Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler
1
LLVM Workshop @CGO’17 February 4, 2017 | Austin - TX
Johannes
Doerfert
University of
Saarbruecken
Michael Kruse, Albert Cohen, Sven
Verdoolaege
Polly Labs
Yabin
Hu
China University of
Geoscience
… many others
Swiss Universities / PASC
ARM,
Qualcomm,
XilinxSlide2
2
row
=
0; output_image_ptr = output_image; output_image_ptr += (NN * dead_rows);
for (r = 0;
r < NN -
KK
+
1
;
r
++)
{ output_image_offset = output_image_ptr; output_image_offset += dead_cols; col = 0; for (c = 0; c < NN - KK + 1; c++) { input_image_ptr = input_image; input_image_ptr += (NN * row); kernel_ptr = kernel;S0: *output_image_offset = 0; for (i = 0; i < KK; i++) { input_image_offset = input_image_ptr; input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < KK; j++) {S1: temp1 = *input_image_offset++;S1: temp2 = *kernel_offset++;S1: *output_image_offset += temp1 * temp2; } kernel_ptr += KK; input_image_ptr += NN; }S2: *output_image_offset = ((*output_image_offset)/ normal_factor); output_image_offset++ ; col++; } output_image_ptr += NN; row++; }}
FortranC/C++
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Multi-Core CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Accelerator
Sequential Software
Parallel Hardware
Development
Maintenance
Performance TuningSlide3
Non-Goal:Algorithmic Changes3
Design Goals
Automatic
“Regression Free”High PerformanceAutomatic Accelerator Mapping
-How close can we get?
Automatic Accelerator
M
apping
-
How close can we get? Slide4
4Polly-ACC: Architecture
Polly-ACC: Transparent
Compilation to Heterogeneous HardwareTobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul
GSoC ‘11Justin Holewinski(OSU/ today NVIDIA)GSoC ‘16Matthias Reisinger(TU Wienna)GSoC ‘10Hongbin Zheng(Xilinx)GSoC ’12/14Yabin HuSlide5
5How to get Polly-ACC
cd <
llvm-src>cd tools; git
clone llvm.org/polly.gitcd <llvm-build>; make clang –O3 –mllvm –polly –mllvm –polly-target=gpu -lGPURuntime –ldl
-o cpu-gpu
-hybrid-prog
file.c
./
cpu
-
gpu-hybrid-prog Slide6
4Polyhedral Loop Modeling
Iteration Space
0
12345j
i
5
4
3
2
1
0
N = 4
j ≤ i
i ≤ N = 4
0 ≤ j
0 ≤ i
D = { (
i,j
) | 0 ≤ i ≤ N ∧ 0
≤ j ≤ i }
Program Code
for
(i = 0; i <= N; i++)
for
(j = 0; j <= i; j++)
S(i,j);
i
= 0, j = 1
i
= 4, j = 4
i
= 4, j = 3
i
= 4, j = 2
i
= 3, j = 3
i
= 4, j = 0
i
= 3, j = 0
i
= 2, j = 0
i
= 1, j = 0
i
= 4, j = 1
i
= 2, j = 1
i
= 1, j = 1
i
= 2, j = 2
i
= 3, j = 1
i
= 3, j = 2
Automatic
Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation
Tobias Grosser, Armin
Groesslinger
, and Christian
Lengauer
in Parallel Processing Letters (PPL), April, 2012Slide7
7
Mapping Computation to Device
[-
mllvm -polly-target=gpu]
0
1
2
0
1
2
0
1
2
3
0
1
2
3
0
0
1
1
Device Blocks & Threads
Iteration
Space
0
1
1
0
i
j
AutomaticSlide8
8
Memory Hierarchy of a Heterogeneous System
Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven
et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide9
9
Host-device Date Transfers
Automatic
Polyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide10
10
Host-device Date Transfers
[-lGPURuntime -lld]
AutomaticPolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide11
11
Mapping to Fast Memory
[-mllvm -polly-acc
-use-shared]High-PerformancePolyhedral Parallel Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013Slide12
12
Polyhedral
Parallel
Code Generation for CUDAVerdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013High-PerformanceMapping to Fast Memory[-mllvm -polly-acc-use-private]Slide13
13
for (i = 1;
i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i
][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide14
14
for (i = 1;
i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i
][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide15
15
for (i = 1;
i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i
][j-1] + A[i][j+1];Accessed Data (for a 2x2 thread block)[-delinearize]Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide16
16
Accessed Data (for a 2x2 thread block)
[-delinearize]
for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Data needed on device12 elementsMinimal data, but complex transferOptimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Pop
at Internnational Conference on Supercomputing (ICS), 2015Slide17
17
for (i = 1;
i <= 6; i++) for (j = 1; j <= 4; j++) … = A[i+1][j] + A[i-1][j] + A[i][j] + A[i
][j-1] + A[i][j+1];One-dimensional hull20 elementsSimple transfer, but redundant dataAccessed Data (for a 2x2 thread block)[-
delinearize]
Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide18
18
Accessed Data (for a 2x2 thread block)
for (i = 1; i <= 6; i++) for (j = 1; j <= 4; j++)
… = A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];Two-dimensional hull16 elementsSimple transfer, less redundant dataModeling multi-dimensional accessbehavior is important
Accessed Data (for a 2x2 thread block)
[-
delinearize
]
“Optimistic Loop Optimization” at CGO’17
with Johannes
Doerfert
and Sebastian Hack
Optimistic Delinearization of Parametrically Sized ArraysTobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, Sebastian Popat Internnational Conference on Supercomputing (ICS), 2015Slide19
a
b
c
defghi19Access Coalescing
a
b
c
d
e
f
g
h
iDRAMHigh-PerformanceSlide20
a
b
c
defghi
a
b
c
d
e
f
g
h
i20Access Coalescing (no permute)abcdefghiDRAMShared Memoryabcdefghi
High-PerformanceSlide21
a
d
g
behcfia
b
c
d
e
f
g
h
i
21Access Coalescing (permuted)abcdefghiDRAMShared Memoryadgbeh
cf
i
High-PerformanceSlide22
Profitability Heuristic[-mllvm -
polly-process-unprofitable=false]
Trivial
Unsuitable
Insufficient Compute
static
dynamic
Modeling
Execution
All
Loop
NestsGPURegression-FreeSlide23
23Kernels to Programs – Data Transfers
void
heat(int n, float A[n], float
hot, float cold) {
float
B[n] = {0};
initialize(n
, A, cold
);
setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); }}High-PerformanceSlide24
24Data Transfer – Per Kernel
Host Memory
initialize()setCenter()
average()average()average()
time
Device Memory
High-PerformanceSlide25
25
Data Transfer – Inter Kernel Caching
[POLLY_CACHE_ALLOCATIONS=1]
Host Memory
Host Memory
initialize
()
setCenter
()
average()
average()average()time Device MemoryHigh-PerformanceSlide26
26Evaluation
Evaluation
Workstation: 10 core
SandyBridge NVIDIA Titan Black (Kepler)Slide27
27
LLVM Nightly Test Suite
# Compute Regions / Kernels Slide28
28
Polybench 3.2 Computional
KernelsBaseline: icc –O3 (sequential), 10 core CPU + NVIDIA Titan Black (workstation)Slide29
29
Lattice Boltzmann (SPEC 2006)Slide30
30
Cactus ADM (SPEC 2006) - PerformanceSlide31
31Cactus ADM (SPEC 2006) - Data TransferSlide32
Polly-ACC32
Automatic
“Regression Free”
High Performancehttp://spcl.inf.ethz.ch/Polly-ACC