/
K F U S I O N K F U S I O N

K F U S I O N - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
366 views
Uploaded On 2016-07-07

K F U S I O N - PPT Presentation

Simple Annotations for Optimized Data Flow Liam Kiemele Celina Berg Aaron Gulliver Yvonne Coady University of Victoria with thanks to Tim Mattson Andrew Brownsword Intel Road Map ID: 393933

2013 kiemele square iwocl kiemele 2013 iwocl square sqrt add kernel float global void data file kfusion pragma ernel

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "K F U S I O N" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

K F U S I O N Simple Annotations for Optimized Data Flow

Liam

Kiemele,

Celina Berg, Aaron Gulliver, Yvonne Coady

University of Victoria

with

thanks to Tim Mattson, Andrew

Brownsword

(Intel)Slide2

Road MapKFusion at workMotivation

KFusion

Costs and benefits

annotations, lines of codemodularity, performanceFuture work and conclusionexplicit composition of computation around data flow

IWOCL 2013 Kiemele

2Slide3

Parallel Hardware

3

IWOCL 2013 Kiemele Slide4

Good News and Bad News…ParallelismAdded complexityOptimizationMemory and BandwidthModularity: Let’s talk Libraries

D

etails behind an API

Optimize data access (prefetching, caching…)Better separation of concernsIWOCL 2013

Kiemele

4Slide5

OpenCL LibrariesOpenCL (Computing Language), for CPUs

and

GPUs

At the heart of any given library will be kernelsSuppose we build an OpenCL Linear Algebra Library__kernel

void add_vectors(

__global float* sum,

__global

float* v1,

__global

float* v2) {

int

i

=

get_global_id

(0);

sum[

i] = v1[i] + v2[i];}

IWOCL 2013 Kiemele

5Slide6

What you get…c = sqrt

(add(square(x), square(y));

s

quaresquare

addsqrt

IWOCL 2013 Kiemele

6Slide7

What you get…c = sqrt

(add(square(x), square(y));

IWOCL 2013

Kiemele 7

Kernel

Operation

Memory Access

Cycles

square

1

load and store

804

square

1

load and store

804

add

1

2 loads and 1 store

804

sqrt

1

load and store

804

total

4

9

3216Slide8

What you WANT!c =

sqrt

(add(square(x), square(y));

xyaddsqrt

IWOCL 2013 Kiemele

8Slide9

What you WANT!c = sqrt

(add(square(x), square(y));

IWOCL 2013

Kiemele 9

Kernel

Operation

Memory Access

Cycles

square

1

load and store

804

square

1

load and store

804

add

1

2 loads and 1 store

804

sqrt

1

load and store

804

total

4

9

3216

Kernel

Operation

Memory Access

Cycles

fu

1

load

404

1

load

404

1

-

4

1

store

404

total

4

3

1216Slide10

Two ChoicesIWOCL 2013 Kiemele

10

Modular Implementation

Reusable

Easy to maintain and develop

Individual Kernel optimization

Monolithic Implementation

Performance

Allows for optimizations which will otherwise exist between modules

Can we do both?Slide11

Introducing KFusionIWOCL 2013 Kiemele

11

11

s

quare(…)

k

ernel square

Application File

Library File

Kernel File

float* square

s

quare(…)

a

dd(…)

k

ernel add …

f

loat* add …

s

qrt

(…)

k

ernel

sqrt

f

loat*

sqrt

Kernel

Operation

Memory Access

Cycles

square

1

load and store

804

square

1

load and store

804

add

1

2 loads and 1 store

804

sqrt

1

load and store

804

total

4

9

3216Slide12

After KFusion…IWOCL 2013 Kiemele

12

12

s

quare(…)

k

ernel square

Application

File

Library File

Kernel File

void square …

s

quare(…)

a

dd(…)

k

ernel add …

v

oid add …

s

qrt

(…)

k

ernel

sqrt

v

oid

sqrt

New Call

:

c =

fu

(…);

New Function:

f

loat*

fu

(…)

New Kernel:

k

ernel

fu

(…)

Kernel

Operation

Memory Access

Cycles

fu

1

load

404

1

load

404

1

-

4

1

store

404

total

4

3

1216Slide13

It works!

IWOCL 2013

Kiemele

13Slide14

Road MapKFusion at work

w

hat and how

…why!Costs and benefitsannotations, lines of codemodularity, performanceFuture work and conclusion

explicit composition of computation around data flow

IWOCL 2013 Kiemele

14Slide15

CostsAnnotationsapplication hints

l

ibrary

synchronizationkernel data flow for compositions Preprocessorbuild dependency graphsource-to-source transformation

loop fusiondeforestation

IWOCL 2013 Kiemele

15

Slide16

Annotations#pragma start fuse

square(

x,x

) square(y,y) add(

c,x,y)

sqrt

(c, c)

c

=

sqrt

(add(,

square(y));

#pragma end fuse

#pragma sync out

public void

dot_product

(double result, vector x);

#pragma sync inpublic void matrix_vector_mult(vector b, Matrix A, vector x)IWOCL 2013 Kiemele

16

a

pplication

LibrarySlide17

Annotations__kernel

void

add_vectors

(__global float* sum, __global float* v1, __global float* v2) {

#

pragma kload

{

int

i

=

get_global_id

(0)

;

float arg1 = v1[

i

];

float arg2 = v2[i]; float s;

}

s =

arg1

+

arg2;

#

pragma

kstore

{

sum[

i

] = s;

}

}

IWOCL 2013

Kiemele

17

k

ernel

addSlide18

Dependency Graph

IWOCL 2013

Kiemele

18

square(x)

square(y)

add(

c,x,y

)

sqrt

(c)

x

y

cSlide19

Transformation…

IWOCL 2013

Kiemele

19

square(x)

square(y)

a

dd_sqrt

(

c,x,y

)

x

c

ySlide20

Replacement Kernel!

IWOCL

2013 Kiemele

20

fu

(

c,x,y

)

x

c

ySlide21

Annotations

AOSD 2013 Kiemele

21Slide22

BenefitsIWOCL 2013 Kiemele

22Slide23

PerformanceIWOCL 2013 Kiemele

23Slide24

PerformanceIWOCL 2013 Kiemele

24Slide25

Roofline Analysis of PerformancePeak Actual GFlops =

minimum(Bandwidth x flops/byte, Peak Performance

)

Three Linear Algebra Scenariosc = sqrt(a2 + b2)d = sqrt

( (x1 – x2)

2 + (y1 – y2

)

2

)

Start of conjugate g

radientr = Ax – bp = rR

2 = r*r

AOSD 2013 Kiemele

25Slide26

c = sqrt(a2 + b2)

IWOCL 2013

Kiemele

26Slide27

d = sqrt((x1 – x2)2 + (y1 – y2)2

)

IWOCL 2013

Kiemele 27Slide28

Conjugate GradientIWOCL 2013 Kiemele

28Slide29

Road MapKFusion at work

w

hat and how

…why!Costs and benefitsannotations, lines of codemodularity, performance

Future work and conclusionexplicit composition of computation around data flow

AOSD 2013 Kiemele

29Slide30

Future WorkTools comprehension and visualizationemulation

performance testing

Combine with other approaches

Optimizing compilesCode GeneratorsIWOCL 2013 Kiemele

30

kfuse

{

calls

}

__kernel void

k

(…)

{

kload

{ … }

computation

kstore

{

}

}Slide31

ConclusionKFusion is a first step towardsexplicit, flexible controlAllowing optimizations between modules

separation of concerns

github.com

/4Liamk/KFusion/wikiIWOCL 2013 Kiemele

31

Related Contents


Next Show more