/
Dissertation Defense Dissertation Defense

Dissertation Defense - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
429 views
Uploaded On 2016-03-12

Dissertation Defense - PPT Presentation

Robert Senser October 29 2014 1 GPU DECLARATIVE FRAMEWORK DEFG PhD Committee Gita Alaghband chair Tom Altman advisor Michael Mannino Boris Stilman Tam Vu Presentation Outline ID: 253382

defg gpu application code gpu defg code application graph multiple cnt bfs declare buffer opencl node design kernel applications sobel integer size

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dissertation Defense" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dissertation Defense

Robert SenserOctober 29, 2014

1

GPU DECLARATIVE FRAMEWORK:DEFG

PhD Committee:

Gita

Alaghband

(chair)

Tom

Altman (advisor)

Michael

Mannino

Boris

Stilman

Tam

VuSlide2

Presentation Outline

Motivation for WorkBackground: Graphical Processing Units (GPUs) and OpenCLGPU DECLARATIVE FRAMEWORK: DEFGDiverse GPU Applications using DEFGImage Filters (Sobel

and Median) Breadth-First Search Sorting Roughly Sorted Data

Iterative Matrix Inversion Dissertation AccomplishmentsFuture Research and Observations2Slide3

Motivation for Work

GPUs can provide high throughputRadeon HD 7990: 2 (double-precision) TFLOPSDeveloping parallel HPC software is difficultParallel development for GPUs is even more difficultGPU HPC software development requires:

Understanding of unique GPU hardware characteristicsUse of specialized algorithmsUse of GPU-specific, low-level APIs

OpenCLCUDADriving notion: Let software minimize the complexity and difficultly.3Slide4

Background: GPUs and OpenCL

Graphical Processing Unit (GPU)Highly specialized coprocessorHundreds of coresThousands of hardware-managed threadsSIMT:

Single Instruction, Multiple ThreadVariant of the common Single Instruction, Multiple

Data (SIMD) modelThreads not on the execution path pauseCode executed in a “kernel” Common GPU programming environmentsOpenCL, which is Open

SourceCUDA, which is NVIDIA proprietaryDEFG is designed for

OpenCL

4Slide5

High-Level GPU Architecture

5

PCIe

bus

RAM

Global RAM

Virtual

Memory

CPU

G

PU

Cache

Cache?

bus

GPU Characteristic:

Processors often

connected

by

Peripheral Component Interconnect Express

(

PCIe

) bus

GPU has own fast Global RAM

Threads have a small amount of fast local memory

May or may not have a cache

M

any hardware-controlled threads

Lacks CPU-style predictive branching, etc.

Local Slide6

OpenCL Overview

Specification provided by Khronos GroupOpen Source, multi-vendorHardware device supportGPUsCPUsDigital signal processors (DSPs)

Field-programmable gate arrays (FPGAs)Device kernel normally written in CEach thread shares a common kernelCPU-side code

C/C++Very low-level, detailed CPU-side application interface (API)Third-party bindings for Java, Python, etc. 6Slide7

GPU Applications

Three componentsApplication algorithmsCPU-side codeMoves kernel code to GPU

Manages GPU execution and errorsMoves

application data between CPU and GPUMay contain a portion of application algorithmsGPU kernel codeCan have multiple kernels per applicationEach kernel usually contains an algorithm or algorithm step

Kernel code often uses GPU-specific techniquesThis work concentrates on the CPU-side code

7Slide8

GPU Performance

Major Issues in GPU PerformanceKernel Instruction Path DivergenceOccurs with conditional instructions (ifs, loops, etc.)Causes some threads to pauseNeeds to be minimized, if not totally avoidedHigh Memory Latency

Each RAM access can consume time of 200-500 instructionsAccesses to global RAM should be coalesced“Bank conflicts” can occur with local thread memory

Rob Farber GPU suggestions [1]:“Get the data on the GPU and leave it”“Give the GPU enough work to do”“Focus on data reuse to avoid memory limits”Existing HPC code usually re-factored for GPU use

8Slide9

DEFG Overview

GPU software development tool for OpenCLGenerates the CPU-side of GPU ApplicationsUses a Domain Specific Language (DSL)Developer writes CPU code in DEFG’s DSLDEFG generates the corresponding CPU C/C++ programRelative to

hand-written CPU codeFaster by using declarative approachSimpler

by using design patterns, and abstractionDeveloper provides standard OpenCL GPU kernels9Slide10

The DEFG generates the C/C++ code for CPUDEFG Translator

DEFG Source InputANTLR-based ParserXML-based TreeOptimizer (Java)Code Generator (C++)Template Driven C/C++ Output

10

DEFG Translator Architecture[2,3]

Translator:Slide11

DEFG Benefits and Features

Implement OpenCL applications with less effortRequires many fewer lines of code to be writtenEncourages the developer to focus on the kernelsHow is this done?With a Doman-Specific Language (DSL)

Data characteristics are declaredUses one or more pre-defined DEFG design patternsMany details managed inside DEFG

Technical FeaturesAbstracts the OpenCL APIs and their detailsAutomatic optimization of buffer transfersHandles error detectionSupports multiple GPU cardsSupports anytime algorithms

11Slide12

DEFG Code Sample

1201. declare application sobel 02. declare integer Xdim

(0)03. declare integer Ydim (0

)04. declare integer BUF_SIZE (0)05. declare gpu gpuone ( any )06. declare kernel sobel_filter

SobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] )07. declare integer buffer image1 ( BUF_SIZE )

08.

integer

buffer image2 (

BUF_SIZE )

09.

call

init_input

(image1(in)

Xdim

(out)

Ydim

(out)

BUF_SIZE(out

))

10. execute

run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output

(image2(in) Xdim (in) Ydim

(in) ) 12. end

…status = clSetKernelArg(sobel_filter, 1,

sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_t

global_work_size[2]; global_work_size[0] = Xdim ;

global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0, NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL);

…Slide13

DEFG Design Patterns

Invocation Patterns (Control Flow)Sequential FlowSingle-Kernel Repeat SequenceMultiple-KernelConcurrent-GPU Patterns (Multiple-GPU Support)Multiple-ExecutionDivide-Process-Merge

Overlapped-Split-Process-ConcatenatePrefix-Allocation (Buffer Allocation, without Locking)Dynamic-Swap (Buffer Swapping)

Code-Morsel (C/C++ Code Insertion)Anytime algorithm (Control Flow Change on External Event)Design patterns can be combinedExample: Multiple-Kernel + Divide-Process-Merge + Code-Morsel13Slide14

DEFG Implementation

Lines of CodeANTLR-based parser: 580 linesOptimizer: 659 lines of JavaCode Generator: 1,513 lines of C++Templates and includes: 1,572 lines of C++ Number of Diagnostic Programs: 15+

Testing investment: Man MonthsFaults tended to be in the C/C++ code generationMost faults were in multi-GPU buffer management

14Slide15

Diverse New DEFG Applications

Constructed four very diverse GPU applicationsImage Processing: Sobel and Median Image FiltersShowcase for multiple GPU support

Graph Theoretic: Breadth-First Search (BFS), Large

GraphsNovel use of prefix sum to avoid GPU low-level lockingBFS processing with multiple GPUsSorting: Sorting Roughly Sorted DataImplementation of novel sorting approachUse of parallel prefix calculations in sorting optimization

Also shows multiple GPU supportNumerical: Iterative Matrix Inversion, M. Altman’s Method

Demonstrates anytime

algorithms

use of

OpenCL

clMath

BLAS (Basic Linear Algebra Subprograms)

When a measure is met, anytime algorithm stops the process

These four applications demonstrate DEFG’s applicability

15Slide16

Filter Application: Sobel

Image FilterSobel operator detects edges in imagesPixel gradient calculated from 3x3 maskUses a single GPU kernel, invoked onceA base-line test application for multiple GPUsExample of DEFG Sobel

operator processing:

16SobelSlide17

Filter Application: Median Filter

Median filter removes “noise” from imagesMedian determined for 3x3 or 5x5 maskAlso uses a single GPU kernel, invoked once2nd base-line

test application for multiple GPUsExample of DEFG median 5x5 filter processing:

17Slide18

Application: Breadth First Search (BFS)

Well-studied graph-theoretic problemFocus: BFS with Very Large Irregular (VLI) GraphsSocial Networking, Routing, Citations, etc.Many published GPU BFS approaches, starting with Harish [4]

18

Harish used “

Dijkstra

” BFS

Vertex frontier as a Boolean array

1 = vertex on frontier

A

GPU thread assigned to each vertex

Can result in poor thread utilizationSlide19

BFS Vertex Frontier

Merrill approach to vertex buffer management [5]Have a buffer with multiple update threadsUses prefix sum to allocate cellsGeneralize this buffer management in DEF-GProvided as a set of kernel functionsUseful for shared buffers with multiple GPU cards

19Slide20

Application: Sorting Roughly Sorted Data

Goal: Improve on O(n log n) sorting bound when sequence is partially sortedBased on the prior sorting work by T. Altman, et al. [6]k

is a measure of “sortedness”A sequence is k-sorted if no element is more than

k positions out of sequenceKnowing k allows for sorts of O(n log k) If k is small then we obtain a substantial performance gainThe k-sorted trait can be GPU exploitedPrefix sum in calculating

k Parallel sorts of sub-sequences

20Slide21

Parallel Roughly Sorting Algorithm

Step LR: Left-to-right scan, compute running maxStep RL: Right-to-left scan, compute running minStep DM:Uses LR-max array and RL-min array as inputsComputes each elements distance measure Step UB: Finds

distance measure upper bound Step Sorting: Using distance measurePerform sort pass one

Perform sort pass two21Notion: Convert the large sort problem into many smaller, parallel sort operations.Slide22

Iterative Matrix Inversion (IMI)

DEFG iterative matrix inversion application using M. Altman’s method [7]Use the Anytime-algorithm approach to manage the iterative inversionInversion is stoppable at “anytime”Can balance run time against accuracyAnytime management in DEFG, not the application

Requires GPU matrix operationsUse OpenCL clMath

(APPML)clMath integration into DEFG22Slide23

M. Altman IMI Approach

23The initial inverse approximation, that is R

0, can be formed by: R0 =

αI where α = 1 / || A || ||

A|| being the Euclidean norm of A and

I

is the identity

matrix.

To invert matrix

A

, each iteration calculates:

R

n+1

= R

n

(3

I

– 3AR

n + (AR

n)2)with the result in

Rn+1. Better R

0 estimate provides for quicker convergenceMethod is self correctingDEFG Anytime facility stops the iterationsWhen inversion quality measure is met

When max iterations have occurredWhen max run time has occurredSlide24

Accomplishments: DEFG Framework

Fully ImplementedConsists of approximately 5,000 code lines7 different applications and 15+ diagnostic programsComplete User’s GuidePackaged for general useDesign Patterns10+ Patterns

Patterns range from simple to complexDelineation of DEF-G LimitsExplanation

for success of DSL and design patterns24Slide25

DEFG’s Performance

DEFG PapersConference: Parallel and Distributed Processing Techniques and Applications (PDPTA’13) [2]Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’14) [3]

AnalysisThree existing OpenCL applications converted to DEFG

CPU-side re-coded in DEFG and used existing GPU kernelsThree applications:Breadth-First Search (BFS)All-Pairs Shortest Path (APSP/FW)Sobel Image Filter (SOBEL)Output results carefully verifiedComparisons between “reference” and DEFGLines-of-Code Comparison

Run-time Performance Comparison

25Slide26

Lines-of-Code Comparison

26

DEFG

DEFG

Source

Gen.

Ref.

SOBEL

12

467

442

BFS

42

620

364

FW

12

481

478

On average, the

DEFG

code is

5.6

percent of the

reference

code.Slide27

27

Shown are original run times (average of 10 runs) Later, made manual changes to understand timing differences FW

reference version was slow due to a vendor coding error

Likely CPU-based BFS-4096 was fast due to CPU’s cacheSummary: DEFG provided equal, or better, performance

Run-Time Performance ComparisonSlide28

New Applications

Application ImplementationsFiltering, BFS, Roughly Sorting, Iterative InversionImplementation GoalsShow general applicability of DEFGMultiple-GPU: Filtering, BFS, and R. Sorting

Novel Algorithm: R. Sorting and Iterative Inversion Proof of Concept: Iterative Inversion (BLAS usage) Application Performance results

Run-time PerformanceSingle-GPU and multiple-GPU configurationsProblem-size characteristicsVast majority of tests run on Hyrda server

28Slide29

Image Filtering Results

Image Applications ImplementationBoth Sobel operator and median filter:Overlapped-split-process-concatenate design patternSingle and Multiple-GPU versions

Analysis with large images and multiple GPUsImage neighborhoods

Sobel operator: 3x3 gridMedian filter: 3x3 grid: less computationally intense5x5 grid: more computationally intense29Slide30

Sobel Operator Application Results

Single-GPU refactored for multiple-GPU useUsed existing OpenCL kernelThree simple DEFG code changes neededNew version used two GPUs50% image plus small overlap given to each

Produced same resultant imageRun-time performance was not impressiveNot sufficiently computationally intense

OpenCL transfer times went upKernel executions times stayed the same30Slide31

Median Filter Application Results

CPU-side DEFG code very similar to SobelDeveloped two new OpenCL kernels3x3 grid kernel5x5 grid kernelPerformance with 3x3 grid similar to

SobelPerformance with multiple-gpu

, 5x5 medianRun-time improvement with all test imagesWith large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs)Speed up: 1.34 with 2 GPUs, with 7k x 7k imageAlso, 2-GPU median filter handled larger images (22k x 22k) Performance Analysis with 2 GPUsKernel execution run times droppedCPU to GPU

OpenCL transfer times increasedpthreads experiment showed need for O.S. threads

31Slide32

Breadth-First Search Results

Breadth-First Search (BFS) SummaryDEFG generalization of MerrillapproachPrefix-scan based buffer allocation:

“Virtual pointers” to nodes between GPUs

Used Harish sparse data structure approachAnalysis of BFS application CharacteristicsCapabilitiesRun-time performance

32Slide33

Multiple-GPU BFS Implementation

BFSDP2GPU DEFG Application Re-factoring of previously-ported DEFG BFS applicationDEFG implementation of complex OpenCL application

Management of shared buffers with prefix sumRun-time communications between GPUsTested application against LVI graphsTest graphs from SNAP and DIMACS repositories

Stanford Network Analysis Package (SNAP) [8]Center for Discrete Mathematic and Theoretical Computer Science [9]Very large graph datasets: millions of vertices and edges

33Slide34

BFSDP2GPU Results

Analysis of BFSDP2GPU applicationCharacteristics Kernel count went from 2 kernels to 6Used 2 GPUsCapabilities and Run-time performanceSingle-card versus multi-card performancePerformance relative to existing BFS

applicationApplication results with LVI graphsProcessed large graphs (4.8M nodes, 69M edges)

However, unimpressive run-time performanceRun Times increased by factors of 6 to 17 Issue: OpenCL’s lack of GPU-to-GPU communicationsLesser issue: Mixing of sparse and dense data structures

34Slide35

Roughly Sorting

RSORT application implementationDivide-process-merge pattern utilizedImplementation contains five kernels:(LRmax

, RLmin, DM, UB, and comb_sort)

Sort selected for OpenCL kernel: comb sortsort-in-place designnon-recursivesimilar to bubble sort but much fasterelements are compared gap apart

35Slide36

RSORT Results

Run-Time Comparisons using large datasetsGenerated with set k value and dataset sizeFully perturbed dataset, or singly Example with a k value of 4, 16 perturbed items:

5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16Performance analysis over 3 configurations

QSORT on CPU, used as base lineRSORT with 1 GPURSORT with 2 GPUs 36Slide37

RSORT Results Summary

37

Application

Implemented in 1-GPU and 2-GPU formsRSORT run times generally faster when k is small

At K:1000, 2-GPU to 1-GPU speed up is 1.732-GPU RSORT handles larger datasets than 1-GPUSlide38

Iterative Matrix Inversion

IMIFLX application implementationUsed DEFG blas statement to access clMath functionsMultiple-kernel-

loop pattern usedMultiple

blas statements per iterationBlend of blas statements and kernelsAnytime used to end iterations at a time limitAnalysis of applicationInversion accuracyRange of matrices: size and typeUsed data from

University of Florida Sparse Matrix Collection[10]

38Slide39

Application Sample Result

39

IMIFLX uses 13 iterations for this 500 x 500 matrix

Norm value: ||

(A*R

n

) -

I

||

Graph shows convergence to a solution

Run time was 0.259 secondsSlide40

Iterative Matrix Inversion Results

40

Required BLAS support in DEFG

Anytime support: traded accuracy for less run time

Hydra’s NVIDIA T20 GPU

Available RAM: 2,678 MB

Limits double precision matrix to just over 8,000 by 8,000

Name

Type

Size

Iterations

Seconds

H2

Hilbert

2x2

4

0.018

H12

Hilbert

12x12

70

0.089

M500

Generated

500x500

13

0.259

M500any

Generated500x500

10

0.206

M8000

Generated

8000x8000

17

1380.320

M8500

Generated

8500x8500

n.a

.

overflow

1138_bus

Repository

1138x1138

14

3.262

Kuu

Repository

7102x7102

9

605.310Slide41

Dissertation Accomplishments

Designed, Implemented, and Tested DEFGProduced DEFG “Enabling” Design PatternsCompared DEFG Applications to Hand-Written DEFG applications required less codeDEFG applications produced equal run times

Applied DEFG to Diverse GPU ApplicationsFour diverse applications fully implementedImpressive application run-time results

with the exception of BFS, due to an OpenCL limit41Slide42

Future Research

Additional DEFG Design PatternsMultiple-GPU Load BalancingResource sharingSuggest DEFG Support for NVIDIA’s CUDASuggest a re-factored DEFGinternal DSL

More-standard programming environmentEnable support more environmentsNot optimistic about

declarative approach for GPU-side Potential for other technical improvements42Slide43

DEFG’s Success

DEFG is a DSL focused on: HPC with GPUsNote the Faber suggestions for GPU performance:“Get the data on the GPU and leave it”“Give the GPU enough work to do”“Focus on data reuse to avoid memory limits”The CPU becomes the orchestrator

DEFG provides the CPU code to orchestrateDeclarations to describe the dataDesign patterns to describe the orchestration

Optimization to minimize data transfers43Slide44

References

44

[1]

Farber, R. CUDA application design and development

. Access Online via Elsevier, 2011.

[2]

Senser

, R. and Altman, T. “DEF-G: Declarative Framework for GPUs”

Proceedings of The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (2013): 490-496.

[3]

Senser

, R. and Altman, T. “A

second generation of DEFG: Declarative Framework for GPUs

Proceedings of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (To be

published November, 2014

).

[4]

Harish, P. and Narayanan, P. "Accelerating large graph algorithms on the GPU using CUDA."

High performance computing–

HiPC

2007

. Springer Berlin Heidelberg, 2007. 197-208.

[5]

Merrill, D., and Andrew S.

Grimshaw

. "Revisiting sorting for GPGPU stream architectures."

Proceedings of the 19th international conference on Parallel architectures and compilation techniques

. ACM, 2010.

[6]

Altman, T. and

Yoshihide

Igarashi. "Roughly sorting: Sequential and parallel approach."

Journal of Information Processing

12.2 (1989): 154-158.

[7]

Altman, M. "An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space."

Pacific Journal of Mathematics

10.4 (1960): 297-300.

[8]

SNAP URL:

http://snap.stanford.edu/data

[9]

DIMACS URL:

http://www.dis.uniroma1.it/challenge9/download.shtml

[10]

University of Florida Sparse Matrix Collection:

http://

www.cise.ufl.edu/research/sparce/matricesSlide45

45

Additional SlidesSlide46

Raw Performance Numbersfor Three Applications, in Milliseconds

46

CPU

GPU-Tesla T20

DEF-G

Ref.

DEF-G

Ref.

BFS-4096

1.5

2.6

4.3

5.8

BFS-65536

12.3

14.2

8.0

11.3

FW

111.8

152.0

6.0

51.2

SOBEL

23.0

24.8

3.7

4.1Slide47

Sample DEFG Code Showing a Sequence

47 01. declare application floydwarshall

02. declare integer NODE_CNT (0) 03. declare integer BUF_SIZE (0) 04. declare

gpu gpuone ( any ) 05. declare kernel floydWarshallPass FloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) 06. declare integer buffer buffer1 ( BUF_SIZE ) 07.

integer buffer buffer2 ( BUF_SIZE ) 08. call init_input (buffer1(in) buffer2(in)

NODE_CNT(out

)

BUF_SIZE(out

))

09.

sequence NODE_CNT

times

10. execute run1

floydWarshallPass

( buffer1(

inout

) buffer2(out)

NODE_CNT(in

)

DEFG_CNT(in

) )

11. call

disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) 12. endSlide48

Sample DEFG Code Showing a

Loop-While 48 declare application

bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0)

declare gpu gpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct

(4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges

(EDGE_CNT

)

integer buffer

graph_mask

(

NODE_CNT

)

integer buffer

updating_graph_mask

( $NODE_CNT )

integer buffer

graph_visited

(NODE_CNT

)

integer buffer cost

(NODE_CNT) // note: init_input handles setting "source" node call

init_input (graph_nodes(out) graph_edges(out)

graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out)

NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in)

graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited

(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned...

execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output

(cost(in)

NODE_CNT(in

))

end Slide49

RSORT Data

49Slide50

IMIFLX Data

50Slide51

DEFG 4-Way Mini-Experiment

SpeedUp51Slide52

Old Slides

52Slide53

DEF-G Input Code

ANTLR-Based DEF-G Parser

XML Document

TinyXML2 & DEF-G Code Generator

OpenCL Code

The DEFG framework generates the CPU code

Input: declarative statements

Uses design patterns

Output:

OpenCL

code

DEFG “Translator”

ANTLR-Based Parser

Intermediate XML Document

TinyXML2 Parser

Code Generator written in C++

53

DEFG Architecture-oldSlide54

Accomplishments ???

DEFGInputs declarationsUses declared design patternsGenerates CPU-side OpenCL codeSummarized the Proof-of-Concept DEFGDescribed the Version 2 DEFG enhancementsDescribed the diverse DEFG applications

These show the DEFG applicability and flexibilityEach is a full application implementation

Addressed DEFG research goals54Slide55

Presentation Outline

Research AccomplishmentsDesigned and implemented DEFGProduced DEFG “Enabling” Design PatternsApplied DEFG to Diverse GPU ApplicationsAnalyzed the Performance of

these ApplicationsFuture Research

55Slide56

Proposed Dissertation Work Plan

(design patterns)DEFG EnhancementsThree existing design patternsExecute kernel once, execute N times, and loop-while

The current loop-while syntax is too “procedural”

New design patternsAnytime algorithm supportMultiple GPU Supportdivide, process, mergeoverlapped split, process, concatenate Explicit ParallelAdd other interesting DEF-G design patterns

56Slide57

(

implimentation)DEFG Enhancements((get list from doc))57Slide58

Image Filtering Results

Developed proof-of-concept DEFG versionCreated a proof-of-concept, hand-written, multiple GPU versionUsing two cards doubled throughputComplexity in managing overlapped sub-images Next steps:DEFG multiple GPU version

Due to 3x3 mask, sub-images overlapUses overlapped split, process, concatenate design pattern

Testing and analysis with large images and more than one GPUExpect to find (or emulate) a four-GPU environment58Slide59

Accomplishments

Declarative Approach to …Design PatternsDEFG ToolsProduced DEFG parser, optimizer, and code generatorDiagnostics …DEFG ApplicationsPerformance verificationImage Filters

Multiple-GPU graph processingRough SortingIterative Matrix Inversion

59Slide60

Proposed New DEFG Features

Additional DEFG Design PatternsMultiple GPU Load BalancingResource sharingDEFG Support for NVIDIA’s CUDARe-factored DEFG (DEFG Version 3)internal DSL

P1P2More-standard programming environmentCould support more environments

60Slide61

Roughly Sorting Implementation

RSORT implementation contained five kernelsComb sortAnalysis and ComplicationsChoose an existing OpenCL sort for useAdd support for multiple GPU cards using DEFG’s divide, process, merge patternPerformance analysis of GPU Roughly Sorting

61