Simple Annotations for Optimized Data Flow Liam Kiemele Celina Berg Aaron Gulliver Yvonne Coady University of Victoria with thanks to Tim Mattson Andrew Brownsword Intel Road Map ID: 393933
Download Presentation The PPT/PDF document "K F U S I O N" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
K F U S I O N Simple Annotations for Optimized Data Flow
Liam
Kiemele,
Celina Berg, Aaron Gulliver, Yvonne Coady
University of Victoria
with
thanks to Tim Mattson, Andrew
Brownsword
(Intel)Slide2
Road MapKFusion at workMotivation
KFusion
Costs and benefits
annotations, lines of codemodularity, performanceFuture work and conclusionexplicit composition of computation around data flow
IWOCL 2013 Kiemele
2Slide3
Parallel Hardware
3
IWOCL 2013 Kiemele Slide4
Good News and Bad News…ParallelismAdded complexityOptimizationMemory and BandwidthModularity: Let’s talk Libraries
D
etails behind an API
Optimize data access (prefetching, caching…)Better separation of concernsIWOCL 2013
Kiemele
4Slide5
OpenCL LibrariesOpenCL (Computing Language), for CPUs
and
GPUs
At the heart of any given library will be kernelsSuppose we build an OpenCL Linear Algebra Library__kernel
void add_vectors(
__global float* sum,
__global
float* v1,
__global
float* v2) {
int
i
=
get_global_id
(0);
sum[
i] = v1[i] + v2[i];}
IWOCL 2013 Kiemele
5Slide6
What you get…c = sqrt
(add(square(x), square(y));
s
quaresquare
addsqrt
IWOCL 2013 Kiemele
6Slide7
What you get…c = sqrt
(add(square(x), square(y));
IWOCL 2013
Kiemele 7
Kernel
Operation
Memory Access
Cycles
square
1
load and store
804
square
1
load and store
804
add
1
2 loads and 1 store
804
sqrt
1
load and store
804
total
4
9
3216Slide8
What you WANT!c =
sqrt
(add(square(x), square(y));
xyaddsqrt
IWOCL 2013 Kiemele
8Slide9
What you WANT!c = sqrt
(add(square(x), square(y));
IWOCL 2013
Kiemele 9
Kernel
Operation
Memory Access
Cycles
square
1
load and store
804
square
1
load and store
804
add
1
2 loads and 1 store
804
sqrt
1
load and store
804
total
4
9
3216
Kernel
Operation
Memory Access
Cycles
fu
1
load
404
1
load
404
1
-
4
1
store
404
total
4
3
1216Slide10
Two ChoicesIWOCL 2013 Kiemele
10
Modular Implementation
Reusable
Easy to maintain and develop
Individual Kernel optimization
Monolithic Implementation
Performance
Allows for optimizations which will otherwise exist between modules
Can we do both?Slide11
Introducing KFusionIWOCL 2013 Kiemele
11
11
s
quare(…)
k
ernel square
Application File
Library File
Kernel File
float* square
s
quare(…)
a
dd(…)
k
ernel add …
f
loat* add …
s
qrt
(…)
k
ernel
sqrt
…
f
loat*
sqrt
…
Kernel
Operation
Memory Access
Cycles
square
1
load and store
804
square
1
load and store
804
add
1
2 loads and 1 store
804
sqrt
1
load and store
804
total
4
9
3216Slide12
After KFusion…IWOCL 2013 Kiemele
12
12
s
quare(…)
k
ernel square
Application
File
Library File
Kernel File
void square …
s
quare(…)
a
dd(…)
k
ernel add …
v
oid add …
s
qrt
(…)
k
ernel
sqrt
…
v
oid
sqrt
…
New Call
:
c =
fu
(…);
New Function:
f
loat*
fu
(…)
New Kernel:
k
ernel
fu
(…)
Kernel
Operation
Memory Access
Cycles
fu
1
load
404
1
load
404
1
-
4
1
store
404
total
4
3
1216Slide13
It works!
IWOCL 2013
Kiemele
13Slide14
Road MapKFusion at work
w
hat and how
…why!Costs and benefitsannotations, lines of codemodularity, performanceFuture work and conclusion
explicit composition of computation around data flow
IWOCL 2013 Kiemele
14Slide15
CostsAnnotationsapplication hints
l
ibrary
synchronizationkernel data flow for compositions Preprocessorbuild dependency graphsource-to-source transformation
loop fusiondeforestation
IWOCL 2013 Kiemele
15
Slide16
Annotations#pragma start fuse
square(
x,x
) square(y,y) add(
c,x,y)
sqrt
(c, c)
c
=
sqrt
(add(,
square(y));
#pragma end fuse
#pragma sync out
public void
dot_product
(double result, vector x);
#pragma sync inpublic void matrix_vector_mult(vector b, Matrix A, vector x)IWOCL 2013 Kiemele
16
a
pplication
LibrarySlide17
Annotations__kernel
void
add_vectors
(__global float* sum, __global float* v1, __global float* v2) {
#
pragma kload
{
int
i
=
get_global_id
(0)
;
float arg1 = v1[
i
];
float arg2 = v2[i]; float s;
}
s =
arg1
+
arg2;
#
pragma
kstore
{
sum[
i
] = s;
}
}
IWOCL 2013
Kiemele
17
k
ernel
addSlide18
Dependency Graph
IWOCL 2013
Kiemele
18
square(x)
square(y)
add(
c,x,y
)
sqrt
(c)
x
y
cSlide19
Transformation…
IWOCL 2013
Kiemele
19
square(x)
square(y)
a
dd_sqrt
(
c,x,y
)
x
c
ySlide20
Replacement Kernel!
IWOCL
2013 Kiemele
20
fu
(
c,x,y
)
x
c
ySlide21
Annotations
AOSD 2013 Kiemele
21Slide22
BenefitsIWOCL 2013 Kiemele
22Slide23
PerformanceIWOCL 2013 Kiemele
23Slide24
PerformanceIWOCL 2013 Kiemele
24Slide25
Roofline Analysis of PerformancePeak Actual GFlops =
minimum(Bandwidth x flops/byte, Peak Performance
)
Three Linear Algebra Scenariosc = sqrt(a2 + b2)d = sqrt
( (x1 – x2)
2 + (y1 – y2
)
2
)
Start of conjugate g
radientr = Ax – bp = rR
2 = r*r
AOSD 2013 Kiemele
25Slide26
c = sqrt(a2 + b2)
IWOCL 2013
Kiemele
26Slide27
d = sqrt((x1 – x2)2 + (y1 – y2)2
)
IWOCL 2013
Kiemele 27Slide28
Conjugate GradientIWOCL 2013 Kiemele
28Slide29
Road MapKFusion at work
w
hat and how
…why!Costs and benefitsannotations, lines of codemodularity, performance
Future work and conclusionexplicit composition of computation around data flow
AOSD 2013 Kiemele
29Slide30
Future WorkTools comprehension and visualizationemulation
performance testing
Combine with other approaches
Optimizing compilesCode GeneratorsIWOCL 2013 Kiemele
30
kfuse
{
calls
}
__kernel void
k
(…)
{
kload
{ … }
computation
kstore
{
…
}
}Slide31
ConclusionKFusion is a first step towardsexplicit, flexible controlAllowing optimizations between modules
separation of concerns
github.com
/4Liamk/KFusion/wikiIWOCL 2013 Kiemele
31