Robert Senser October 29 2014 1 GPU DECLARATIVE FRAMEWORK DEFG PhD Committee Gita Alaghband chair Tom Altman advisor Michael Mannino Boris Stilman Tam Vu Presentation Outline ID: 253382
Download Presentation The PPT/PDF document "Dissertation Defense" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dissertation Defense
Robert SenserOctober 29, 2014
1
GPU DECLARATIVE FRAMEWORK:DEFG
PhD Committee:
Gita
Alaghband
(chair)
Tom
Altman (advisor)
Michael
Mannino
Boris
Stilman
Tam
VuSlide2
Presentation Outline
Motivation for WorkBackground: Graphical Processing Units (GPUs) and OpenCLGPU DECLARATIVE FRAMEWORK: DEFGDiverse GPU Applications using DEFGImage Filters (Sobel
and Median) Breadth-First Search Sorting Roughly Sorted Data
Iterative Matrix Inversion Dissertation AccomplishmentsFuture Research and Observations2Slide3
Motivation for Work
GPUs can provide high throughputRadeon HD 7990: 2 (double-precision) TFLOPSDeveloping parallel HPC software is difficultParallel development for GPUs is even more difficultGPU HPC software development requires:
Understanding of unique GPU hardware characteristicsUse of specialized algorithmsUse of GPU-specific, low-level APIs
OpenCLCUDADriving notion: Let software minimize the complexity and difficultly.3Slide4
Background: GPUs and OpenCL
Graphical Processing Unit (GPU)Highly specialized coprocessorHundreds of coresThousands of hardware-managed threadsSIMT:
Single Instruction, Multiple ThreadVariant of the common Single Instruction, Multiple
Data (SIMD) modelThreads not on the execution path pauseCode executed in a “kernel” Common GPU programming environmentsOpenCL, which is Open
SourceCUDA, which is NVIDIA proprietaryDEFG is designed for
OpenCL
4Slide5
High-Level GPU Architecture
5
PCIe
bus
RAM
Global RAM
Virtual
Memory
CPU
G
PU
Cache
Cache?
bus
GPU Characteristic:
Processors often
connected
by
Peripheral Component Interconnect Express
(
PCIe
) bus
GPU has own fast Global RAM
Threads have a small amount of fast local memory
May or may not have a cache
M
any hardware-controlled threads
Lacks CPU-style predictive branching, etc.
Local Slide6
OpenCL Overview
Specification provided by Khronos GroupOpen Source, multi-vendorHardware device supportGPUsCPUsDigital signal processors (DSPs)
Field-programmable gate arrays (FPGAs)Device kernel normally written in CEach thread shares a common kernelCPU-side code
C/C++Very low-level, detailed CPU-side application interface (API)Third-party bindings for Java, Python, etc. 6Slide7
GPU Applications
Three componentsApplication algorithmsCPU-side codeMoves kernel code to GPU
Manages GPU execution and errorsMoves
application data between CPU and GPUMay contain a portion of application algorithmsGPU kernel codeCan have multiple kernels per applicationEach kernel usually contains an algorithm or algorithm step
Kernel code often uses GPU-specific techniquesThis work concentrates on the CPU-side code
7Slide8
GPU Performance
Major Issues in GPU PerformanceKernel Instruction Path DivergenceOccurs with conditional instructions (ifs, loops, etc.)Causes some threads to pauseNeeds to be minimized, if not totally avoidedHigh Memory Latency
Each RAM access can consume time of 200-500 instructionsAccesses to global RAM should be coalesced“Bank conflicts” can occur with local thread memory
Rob Farber GPU suggestions [1]:“Get the data on the GPU and leave it”“Give the GPU enough work to do”“Focus on data reuse to avoid memory limits”Existing HPC code usually re-factored for GPU use
8Slide9
DEFG Overview
GPU software development tool for OpenCLGenerates the CPU-side of GPU ApplicationsUses a Domain Specific Language (DSL)Developer writes CPU code in DEFG’s DSLDEFG generates the corresponding CPU C/C++ programRelative to
hand-written CPU codeFaster by using declarative approachSimpler
by using design patterns, and abstractionDeveloper provides standard OpenCL GPU kernels9Slide10
The DEFG generates the C/C++ code for CPUDEFG Translator
DEFG Source InputANTLR-based ParserXML-based TreeOptimizer (Java)Code Generator (C++)Template Driven C/C++ Output
10
DEFG Translator Architecture[2,3]
Translator:Slide11
DEFG Benefits and Features
Implement OpenCL applications with less effortRequires many fewer lines of code to be writtenEncourages the developer to focus on the kernelsHow is this done?With a Doman-Specific Language (DSL)
Data characteristics are declaredUses one or more pre-defined DEFG design patternsMany details managed inside DEFG
Technical FeaturesAbstracts the OpenCL APIs and their detailsAutomatic optimization of buffer transfersHandles error detectionSupports multiple GPU cardsSupports anytime algorithms
11Slide12
DEFG Code Sample
1201. declare application sobel 02. declare integer Xdim
(0)03. declare integer Ydim (0
)04. declare integer BUF_SIZE (0)05. declare gpu gpuone ( any )06. declare kernel sobel_filter
SobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] )07. declare integer buffer image1 ( BUF_SIZE )
08.
integer
buffer image2 (
BUF_SIZE )
09.
call
init_input
(image1(in)
Xdim
(out)
Ydim
(out)
BUF_SIZE(out
))
10. execute
run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output
(image2(in) Xdim (in) Ydim
(in) ) 12. end
…status = clSetKernelArg(sobel_filter, 1,
sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_t
global_work_size[2]; global_work_size[0] = Xdim ;
global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0, NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL);
…Slide13
DEFG Design Patterns
Invocation Patterns (Control Flow)Sequential FlowSingle-Kernel Repeat SequenceMultiple-KernelConcurrent-GPU Patterns (Multiple-GPU Support)Multiple-ExecutionDivide-Process-Merge
Overlapped-Split-Process-ConcatenatePrefix-Allocation (Buffer Allocation, without Locking)Dynamic-Swap (Buffer Swapping)
Code-Morsel (C/C++ Code Insertion)Anytime algorithm (Control Flow Change on External Event)Design patterns can be combinedExample: Multiple-Kernel + Divide-Process-Merge + Code-Morsel13Slide14
DEFG Implementation
Lines of CodeANTLR-based parser: 580 linesOptimizer: 659 lines of JavaCode Generator: 1,513 lines of C++Templates and includes: 1,572 lines of C++ Number of Diagnostic Programs: 15+
Testing investment: Man MonthsFaults tended to be in the C/C++ code generationMost faults were in multi-GPU buffer management
14Slide15
Diverse New DEFG Applications
Constructed four very diverse GPU applicationsImage Processing: Sobel and Median Image FiltersShowcase for multiple GPU support
Graph Theoretic: Breadth-First Search (BFS), Large
GraphsNovel use of prefix sum to avoid GPU low-level lockingBFS processing with multiple GPUsSorting: Sorting Roughly Sorted DataImplementation of novel sorting approachUse of parallel prefix calculations in sorting optimization
Also shows multiple GPU supportNumerical: Iterative Matrix Inversion, M. Altman’s Method
Demonstrates anytime
algorithms
use of
OpenCL
clMath
BLAS (Basic Linear Algebra Subprograms)
When a measure is met, anytime algorithm stops the process
These four applications demonstrate DEFG’s applicability
15Slide16
Filter Application: Sobel
Image FilterSobel operator detects edges in imagesPixel gradient calculated from 3x3 maskUses a single GPU kernel, invoked onceA base-line test application for multiple GPUsExample of DEFG Sobel
operator processing:
16SobelSlide17
Filter Application: Median Filter
Median filter removes “noise” from imagesMedian determined for 3x3 or 5x5 maskAlso uses a single GPU kernel, invoked once2nd base-line
test application for multiple GPUsExample of DEFG median 5x5 filter processing:
17Slide18
Application: Breadth First Search (BFS)
Well-studied graph-theoretic problemFocus: BFS with Very Large Irregular (VLI) GraphsSocial Networking, Routing, Citations, etc.Many published GPU BFS approaches, starting with Harish [4]
18
Harish used “
Dijkstra
” BFS
Vertex frontier as a Boolean array
1 = vertex on frontier
A
GPU thread assigned to each vertex
Can result in poor thread utilizationSlide19
BFS Vertex Frontier
Merrill approach to vertex buffer management [5]Have a buffer with multiple update threadsUses prefix sum to allocate cellsGeneralize this buffer management in DEF-GProvided as a set of kernel functionsUseful for shared buffers with multiple GPU cards
19Slide20
Application: Sorting Roughly Sorted Data
Goal: Improve on O(n log n) sorting bound when sequence is partially sortedBased on the prior sorting work by T. Altman, et al. [6]k
is a measure of “sortedness”A sequence is k-sorted if no element is more than
k positions out of sequenceKnowing k allows for sorts of O(n log k) If k is small then we obtain a substantial performance gainThe k-sorted trait can be GPU exploitedPrefix sum in calculating
k Parallel sorts of sub-sequences
20Slide21
Parallel Roughly Sorting Algorithm
Step LR: Left-to-right scan, compute running maxStep RL: Right-to-left scan, compute running minStep DM:Uses LR-max array and RL-min array as inputsComputes each elements distance measure Step UB: Finds
distance measure upper bound Step Sorting: Using distance measurePerform sort pass one
Perform sort pass two21Notion: Convert the large sort problem into many smaller, parallel sort operations.Slide22
Iterative Matrix Inversion (IMI)
DEFG iterative matrix inversion application using M. Altman’s method [7]Use the Anytime-algorithm approach to manage the iterative inversionInversion is stoppable at “anytime”Can balance run time against accuracyAnytime management in DEFG, not the application
Requires GPU matrix operationsUse OpenCL clMath
(APPML)clMath integration into DEFG22Slide23
M. Altman IMI Approach
23The initial inverse approximation, that is R
0, can be formed by: R0 =
αI where α = 1 / || A || ||
A|| being the Euclidean norm of A and
I
is the identity
matrix.
To invert matrix
A
, each iteration calculates:
R
n+1
= R
n
(3
I
– 3AR
n + (AR
n)2)with the result in
Rn+1. Better R
0 estimate provides for quicker convergenceMethod is self correctingDEFG Anytime facility stops the iterationsWhen inversion quality measure is met
When max iterations have occurredWhen max run time has occurredSlide24
Accomplishments: DEFG Framework
Fully ImplementedConsists of approximately 5,000 code lines7 different applications and 15+ diagnostic programsComplete User’s GuidePackaged for general useDesign Patterns10+ Patterns
Patterns range from simple to complexDelineation of DEF-G LimitsExplanation
for success of DSL and design patterns24Slide25
DEFG’s Performance
DEFG PapersConference: Parallel and Distributed Processing Techniques and Applications (PDPTA’13) [2]Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’14) [3]
AnalysisThree existing OpenCL applications converted to DEFG
CPU-side re-coded in DEFG and used existing GPU kernelsThree applications:Breadth-First Search (BFS)All-Pairs Shortest Path (APSP/FW)Sobel Image Filter (SOBEL)Output results carefully verifiedComparisons between “reference” and DEFGLines-of-Code Comparison
Run-time Performance Comparison
25Slide26
Lines-of-Code Comparison
26
DEFG
DEFG
Source
Gen.
Ref.
SOBEL
12
467
442
BFS
42
620
364
FW
12
481
478
On average, the
DEFG
code is
5.6
percent of the
reference
code.Slide27
27
Shown are original run times (average of 10 runs) Later, made manual changes to understand timing differences FW
reference version was slow due to a vendor coding error
Likely CPU-based BFS-4096 was fast due to CPU’s cacheSummary: DEFG provided equal, or better, performance
Run-Time Performance ComparisonSlide28
New Applications
Application ImplementationsFiltering, BFS, Roughly Sorting, Iterative InversionImplementation GoalsShow general applicability of DEFGMultiple-GPU: Filtering, BFS, and R. Sorting
Novel Algorithm: R. Sorting and Iterative Inversion Proof of Concept: Iterative Inversion (BLAS usage) Application Performance results
Run-time PerformanceSingle-GPU and multiple-GPU configurationsProblem-size characteristicsVast majority of tests run on Hyrda server
28Slide29
Image Filtering Results
Image Applications ImplementationBoth Sobel operator and median filter:Overlapped-split-process-concatenate design patternSingle and Multiple-GPU versions
Analysis with large images and multiple GPUsImage neighborhoods
Sobel operator: 3x3 gridMedian filter: 3x3 grid: less computationally intense5x5 grid: more computationally intense29Slide30
Sobel Operator Application Results
Single-GPU refactored for multiple-GPU useUsed existing OpenCL kernelThree simple DEFG code changes neededNew version used two GPUs50% image plus small overlap given to each
Produced same resultant imageRun-time performance was not impressiveNot sufficiently computationally intense
OpenCL transfer times went upKernel executions times stayed the same30Slide31
Median Filter Application Results
CPU-side DEFG code very similar to SobelDeveloped two new OpenCL kernels3x3 grid kernel5x5 grid kernelPerformance with 3x3 grid similar to
SobelPerformance with multiple-gpu
, 5x5 medianRun-time improvement with all test imagesWith large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs)Speed up: 1.34 with 2 GPUs, with 7k x 7k imageAlso, 2-GPU median filter handled larger images (22k x 22k) Performance Analysis with 2 GPUsKernel execution run times droppedCPU to GPU
OpenCL transfer times increasedpthreads experiment showed need for O.S. threads
31Slide32
Breadth-First Search Results
Breadth-First Search (BFS) SummaryDEFG generalization of MerrillapproachPrefix-scan based buffer allocation:
“Virtual pointers” to nodes between GPUs
Used Harish sparse data structure approachAnalysis of BFS application CharacteristicsCapabilitiesRun-time performance
32Slide33
Multiple-GPU BFS Implementation
BFSDP2GPU DEFG Application Re-factoring of previously-ported DEFG BFS applicationDEFG implementation of complex OpenCL application
Management of shared buffers with prefix sumRun-time communications between GPUsTested application against LVI graphsTest graphs from SNAP and DIMACS repositories
Stanford Network Analysis Package (SNAP) [8]Center for Discrete Mathematic and Theoretical Computer Science [9]Very large graph datasets: millions of vertices and edges
33Slide34
BFSDP2GPU Results
Analysis of BFSDP2GPU applicationCharacteristics Kernel count went from 2 kernels to 6Used 2 GPUsCapabilities and Run-time performanceSingle-card versus multi-card performancePerformance relative to existing BFS
applicationApplication results with LVI graphsProcessed large graphs (4.8M nodes, 69M edges)
However, unimpressive run-time performanceRun Times increased by factors of 6 to 17 Issue: OpenCL’s lack of GPU-to-GPU communicationsLesser issue: Mixing of sparse and dense data structures
34Slide35
Roughly Sorting
RSORT application implementationDivide-process-merge pattern utilizedImplementation contains five kernels:(LRmax
, RLmin, DM, UB, and comb_sort)
Sort selected for OpenCL kernel: comb sortsort-in-place designnon-recursivesimilar to bubble sort but much fasterelements are compared gap apart
35Slide36
RSORT Results
Run-Time Comparisons using large datasetsGenerated with set k value and dataset sizeFully perturbed dataset, or singly Example with a k value of 4, 16 perturbed items:
5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16Performance analysis over 3 configurations
QSORT on CPU, used as base lineRSORT with 1 GPURSORT with 2 GPUs 36Slide37
RSORT Results Summary
37
Application
Implemented in 1-GPU and 2-GPU formsRSORT run times generally faster when k is small
At K:1000, 2-GPU to 1-GPU speed up is 1.732-GPU RSORT handles larger datasets than 1-GPUSlide38
Iterative Matrix Inversion
IMIFLX application implementationUsed DEFG blas statement to access clMath functionsMultiple-kernel-
loop pattern usedMultiple
blas statements per iterationBlend of blas statements and kernelsAnytime used to end iterations at a time limitAnalysis of applicationInversion accuracyRange of matrices: size and typeUsed data from
University of Florida Sparse Matrix Collection[10]
38Slide39
Application Sample Result
39
IMIFLX uses 13 iterations for this 500 x 500 matrix
Norm value: ||
(A*R
n
) -
I
||
Graph shows convergence to a solution
Run time was 0.259 secondsSlide40
Iterative Matrix Inversion Results
40
Required BLAS support in DEFG
Anytime support: traded accuracy for less run time
Hydra’s NVIDIA T20 GPU
Available RAM: 2,678 MB
Limits double precision matrix to just over 8,000 by 8,000
Name
Type
Size
Iterations
Seconds
H2
Hilbert
2x2
4
0.018
H12
Hilbert
12x12
70
0.089
M500
Generated
500x500
13
0.259
M500any
Generated500x500
10
0.206
M8000
Generated
8000x8000
17
1380.320
M8500
Generated
8500x8500
n.a
.
overflow
1138_bus
Repository
1138x1138
14
3.262
Kuu
Repository
7102x7102
9
605.310Slide41
Dissertation Accomplishments
Designed, Implemented, and Tested DEFGProduced DEFG “Enabling” Design PatternsCompared DEFG Applications to Hand-Written DEFG applications required less codeDEFG applications produced equal run times
Applied DEFG to Diverse GPU ApplicationsFour diverse applications fully implementedImpressive application run-time results
with the exception of BFS, due to an OpenCL limit41Slide42
Future Research
Additional DEFG Design PatternsMultiple-GPU Load BalancingResource sharingSuggest DEFG Support for NVIDIA’s CUDASuggest a re-factored DEFGinternal DSL
More-standard programming environmentEnable support more environmentsNot optimistic about
declarative approach for GPU-side Potential for other technical improvements42Slide43
DEFG’s Success
DEFG is a DSL focused on: HPC with GPUsNote the Faber suggestions for GPU performance:“Get the data on the GPU and leave it”“Give the GPU enough work to do”“Focus on data reuse to avoid memory limits”The CPU becomes the orchestrator
DEFG provides the CPU code to orchestrateDeclarations to describe the dataDesign patterns to describe the orchestration
Optimization to minimize data transfers43Slide44
References
44
[1]
Farber, R. CUDA application design and development
. Access Online via Elsevier, 2011.
[2]
Senser
, R. and Altman, T. “DEF-G: Declarative Framework for GPUs”
Proceedings of The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (2013): 490-496.
[3]
Senser
, R. and Altman, T. “A
second generation of DEFG: Declarative Framework for GPUs
”
Proceedings of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (To be
published November, 2014
).
[4]
Harish, P. and Narayanan, P. "Accelerating large graph algorithms on the GPU using CUDA."
High performance computing–
HiPC
2007
. Springer Berlin Heidelberg, 2007. 197-208.
[5]
Merrill, D., and Andrew S.
Grimshaw
. "Revisiting sorting for GPGPU stream architectures."
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
. ACM, 2010.
[6]
Altman, T. and
Yoshihide
Igarashi. "Roughly sorting: Sequential and parallel approach."
Journal of Information Processing
12.2 (1989): 154-158.
[7]
Altman, M. "An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space."
Pacific Journal of Mathematics
10.4 (1960): 297-300.
[8]
SNAP URL:
http://snap.stanford.edu/data
[9]
DIMACS URL:
http://www.dis.uniroma1.it/challenge9/download.shtml
[10]
University of Florida Sparse Matrix Collection:
http://
www.cise.ufl.edu/research/sparce/matricesSlide45
45
Additional SlidesSlide46
Raw Performance Numbersfor Three Applications, in Milliseconds
46
CPU
GPU-Tesla T20
DEF-G
Ref.
DEF-G
Ref.
BFS-4096
1.5
2.6
4.3
5.8
BFS-65536
12.3
14.2
8.0
11.3
FW
111.8
152.0
6.0
51.2
SOBEL
23.0
24.8
3.7
4.1Slide47
Sample DEFG Code Showing a Sequence
47 01. declare application floydwarshall
02. declare integer NODE_CNT (0) 03. declare integer BUF_SIZE (0) 04. declare
gpu gpuone ( any ) 05. declare kernel floydWarshallPass FloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) 06. declare integer buffer buffer1 ( BUF_SIZE ) 07.
integer buffer buffer2 ( BUF_SIZE ) 08. call init_input (buffer1(in) buffer2(in)
NODE_CNT(out
)
BUF_SIZE(out
))
09.
sequence NODE_CNT
times
10. execute run1
floydWarshallPass
( buffer1(
inout
) buffer2(out)
NODE_CNT(in
)
DEFG_CNT(in
) )
11. call
disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) 12. endSlide48
Sample DEFG Code Showing a
Loop-While 48 declare application
bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0)
declare gpu gpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct
(4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges
(EDGE_CNT
)
integer buffer
graph_mask
(
NODE_CNT
)
integer buffer
updating_graph_mask
( $NODE_CNT )
integer buffer
graph_visited
(NODE_CNT
)
integer buffer cost
(NODE_CNT) // note: init_input handles setting "source" node call
init_input (graph_nodes(out) graph_edges(out)
graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out)
NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in)
graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited
(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned...
execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output
(cost(in)
NODE_CNT(in
))
end Slide49
RSORT Data
49Slide50
IMIFLX Data
50Slide51
DEFG 4-Way Mini-Experiment
SpeedUp51Slide52
Old Slides
52Slide53
DEF-G Input Code
ANTLR-Based DEF-G Parser
XML Document
TinyXML2 & DEF-G Code Generator
OpenCL Code
The DEFG framework generates the CPU code
Input: declarative statements
Uses design patterns
Output:
OpenCL
code
DEFG “Translator”
ANTLR-Based Parser
Intermediate XML Document
TinyXML2 Parser
Code Generator written in C++
53
DEFG Architecture-oldSlide54
Accomplishments ???
DEFGInputs declarationsUses declared design patternsGenerates CPU-side OpenCL codeSummarized the Proof-of-Concept DEFGDescribed the Version 2 DEFG enhancementsDescribed the diverse DEFG applications
These show the DEFG applicability and flexibilityEach is a full application implementation
Addressed DEFG research goals54Slide55
Presentation Outline
Research AccomplishmentsDesigned and implemented DEFGProduced DEFG “Enabling” Design PatternsApplied DEFG to Diverse GPU ApplicationsAnalyzed the Performance of
these ApplicationsFuture Research
55Slide56
Proposed Dissertation Work Plan
(design patterns)DEFG EnhancementsThree existing design patternsExecute kernel once, execute N times, and loop-while
The current loop-while syntax is too “procedural”
New design patternsAnytime algorithm supportMultiple GPU Supportdivide, process, mergeoverlapped split, process, concatenate Explicit ParallelAdd other interesting DEF-G design patterns
56Slide57
(
implimentation)DEFG Enhancements((get list from doc))57Slide58
Image Filtering Results
Developed proof-of-concept DEFG versionCreated a proof-of-concept, hand-written, multiple GPU versionUsing two cards doubled throughputComplexity in managing overlapped sub-images Next steps:DEFG multiple GPU version
Due to 3x3 mask, sub-images overlapUses overlapped split, process, concatenate design pattern
Testing and analysis with large images and more than one GPUExpect to find (or emulate) a four-GPU environment58Slide59
Accomplishments
Declarative Approach to …Design PatternsDEFG ToolsProduced DEFG parser, optimizer, and code generatorDiagnostics …DEFG ApplicationsPerformance verificationImage Filters
Multiple-GPU graph processingRough SortingIterative Matrix Inversion
59Slide60
Proposed New DEFG Features
Additional DEFG Design PatternsMultiple GPU Load BalancingResource sharingDEFG Support for NVIDIA’s CUDARe-factored DEFG (DEFG Version 3)internal DSL
P1P2More-standard programming environmentCould support more environments
60Slide61
Roughly Sorting Implementation
RSORT implementation contained five kernelsComb sortAnalysis and ComplicationsChoose an existing OpenCL sort for useAdd support for multiple GPU cards using DEFG’s divide, process, merge patternPerformance analysis of GPU Roughly Sorting
61