LiWen Chang a Juan GómezLuna b Izzat El Hajj a Sitao Huang a Deming Chen a and WenMei W Hwu a a University of Illinois at UrbanaChampaign ID: 792574
Download The PPT/PDF document "Collaborative Computing for Heterogeneou..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Collaborative Computing for Heterogeneous Integrated Systems
Li-Wen Chang(a), Juan Gómez-Luna(b), Izzat El Hajj(a), Sitao Huang(a), Deming Chen(a), and Wen-Mei W. Hwu(a)(a)University of Illinois at Urbana-Champaign(b)Universidad de Córdoba
8
th
ACM/SPEC International Conference on Performance Engineering
Slide2Collaborative Computing
Performance and energy efficiencyTraditionally, CPUs + Accelerator (GPU, FPGA…)Recent programming frameworksCUDA8.0, OpenCL2.0New heterogeneous featuresShared virtual memory, coherence, and system-wide atomicsTighter integration allows fine-grain collaborationWe envision integrated systems including CPUs, GPUs and/or FPGAsNew high-level programming languages will synthesize kernels with different collaborative patterns2Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017
Slide3Integrated Heterogeneous Systems
Our vision of an integrated heterogeneous system3Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017CPU core 0CPU core 1CPU core N-1
…
L1
L1
L1
…
L2
FPGA
DMA
Scratchpad
Coherent interconnect
LLC
DRAM controller
DRAM
DRAM
DRAM
DRAM
Crossbar
Coherent bus
Non-coherent bus
L2
CU
L1
CU
L1
CU
L1
CU
L1
CU
L1
…
…
GPU
Slide4Collaborative Patterns
4Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017
…
…
d
ata-parallel tasks
s
equential sub-tasks
c
oarse-grained synchronization
Program Structure
Data Partitioning
…
…
Device 1
Device 2
…
…
Slide5Collaborative Patterns
5Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017
…
…
d
ata-parallel tasks
s
equential sub-tasks
c
oarse-grained synchronization
Program Structure
Fine-grained Task Partitioning
Device 1
Device 2
…
…
…
…
…
…
Slide6Collaborative Patterns
6Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017
…
…
d
ata-parallel tasks
s
equential sub-tasks
c
oarse-grained synchronization
Program Structure
…
…
Device 1
Device 2
Coarse-grained Task Partitioning
Slide7Chai Benchmark Suite
These collaboration patterns can be found in Chai benchmark suite8 data partitioning3 fine-grained task partitioning3 coarse-grained task partitioning7Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017https://chai-benchmarks.github.ioJ. Gómez-Luna, I. El Hajj, L.-W. Chang, V. Garcia-Flores, S. Garcia de Gonzalo, T. Jablin, A. J. Peña, W.-M. Hwu. Chai: Collaborative Heterogeneous Applications for Integrated-architectures. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017
Slide8Preliminary Evaluation
Canny Edge Detection (CED) and RANSAC (RSC)CEDD (data part.) and CEDT (coarse-grained part.)RSCD (data part.) and RSCT (fine-grained part.)8Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017CPU+GPU (AMD Kaveri)
CPU+FPGA (
Xeon+StratixV
)
Slide9Programming Interface
Conversion between collaboration strategies is a code transformationThe best strategy across systems is a performance portability problemLimitation of current practiceOpenCL requires programmers express the collaboration strategy explicitlyLack of flexibilityConversion between strategies is challengingTo fine-grained task partitioning: Partitioning (fission) point, Kernel fissionFrom fine-grained task partitioning: Sophisticated Loop transform, Kernel fusion9Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017
Slide10Programming Interface
High-level programming languagesE.g. TANGRAM provides performance portabilityCollaboration strategies as optimizations in kernel synthesisTANGRAM’s atomic codelets are fine-grain sub-tasksNo need to identify fission pointsCoarse-grain tasks synthesized as kernelsExtension of TANGRAM’s map and partition for data partitioningFor FPGAs HLS will be introduced as a backend10Juan Gómez Luna8th ACM/SPEC International Conference on Performance Engineering. L’Aquila, April 26, 2017Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, Wen-mei Hwu.Efficient Kernel Synthesis for Performance Portable ProgrammingIn Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016
Slide11Collaborative Computing for Heterogeneous Integrated Systems
Thanks!8th ACM/SPEC International Conference on Performance Engineering
https://chai-
benchmarks.github.io