Jingfei Kong Martin Dimitrov Yi Yang Janaka Liyanage Lin Cao Jacob Staples Mike Mantor Huiyang Zhou Motivation With high memory bandwidth and teraflops computing capability Graphics Processor Units GPUs become quite attractive for accelerating general purpose app ID: 684170
Download Presentation The PPT/PDF document "Accelerating MATLAB Image Processing Too..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Accelerating MATLAB Image Processing Toolbox Functions on GPUs
Jingfei
Kong
, Martin
Dimitrov
, Yi Yang,
Janaka
Liyanage
, Lin Cao, Jacob Staples, Mike
Mantor
,
Huiyang
ZhouSlide2
Motivation
With high memory bandwidth and teraflops computing capability, Graphics Processor Units (GPUs) become quite attractive for accelerating general purpose applications
Developing high-performance GPU programs, however, requires deep understanding of both application algorithms and GPU hardware architecture
A systematic way of dealing with a generic class of applications is missing
University of Central Florida
2Slide3
Our Contributions
Compare performance-critical hardware features in different GPUs
Develop high-quality open-source library code for some representative functions in MATLAB™ Image Processing Toolbox (IPT)
https://sites.google.com/site/iptatiproject/ [15]Reveal insights on efficiently accelerating a wide range of image processing algorithms
University of Central Florida
3Slide4
Presentation Outline
Motivation
Our Contributions
Implication of GPU hardware on GPGPU programmingA GPGPU library for IPT functionscategorization and optimization strategies
Case Studies2D convolutionditherConclusions
University of Central Florida
4Slide5
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI
HD5870
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data itemStream processor pipeline5-way VLIW ILP needed to make ALUs busyScalar pipeline less ILP needed
University of Central Florida
5Slide6
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida
6
Memory Access BandwidthVector-type (float2 or float4) data accessScalar-type (float) or vector-type (float2) data access
Our experiments show:
Using float instead would reduce bandwidth by at least 10%
Using float4 instead would reduce bandwidth by at least 16%Slide7
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida
7
Register File 256KB per SIMD engine * 20 SIMD engines = 5MB in total
64KB per SM * 30 SMs = 1.875MB in totalSlide8
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870 (RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida
8
Shared memory/ local data share
32KB per SIMD engine * 20 SIMD
engines = 640KB in total
16KB per SM * 30 SMs = 480KB in totalSlide9
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870
(RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida
9Ratio of Peak Computation Throughput / Peak Memory Bandwidth
(2720 GFLOPS)/(
154 GB/s
) =
17.7 flop/B
= 70.6 flop/word
(624
G
FLOPS
)/(141 GB/s
) = 4.4 flop/B
= 17.7 flop/wordSlide10
Implication of GPU hardware on GPGPU programming
Performance-Critical Hardware Features
Implication on GPGPU Programs
AMD/ATI HD5870
(RV870)
NVIDIA GTX280
Memory Access Bandwidth
Vector-type (float2 or float4) data access
Scalar-type (float) or vector-type (float2) data access
Register File
A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core
A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.
Shared Memory/Local Data Share
Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)
Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)
Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)
(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item
(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item
Stream processor pipeline
5-way VLIW ILP needed to make ALUs busy
Scalar pipeline less ILP needed
University of Central Florida
10Stream processor pipeline5-way VLIW
Scalar
pipelineSlide11
Summary of the LibraryMATLAB Image Processing Toolbox (IPT) Function Classification
University of Central Florida
11
Function
Category
Function
Name
Function
Description
Data
independent
intlut
Convert
integer values using lookup tableimadjustAdjust image intensity valuesimlincombLinear combination of imagesData sharing
edge
Find
edges in grayscale image
imregionalmax
Regional maxima of an image
ordfilt2
2-D order-statistic filtering conv22D convolution of an image
mean2
Average
of matrix elements
imdilate
/
imerode
Dilate/erode
a grayscale image
(C)
Algorithm
dependent
bwdist
Euclidean
distance
transform of
a
binary image
radon
Radon
transform
Data
dependent
dither
Represent
grayscale images in
binary
formatSlide12
MATLAB IPT Function Classification and
Optimization Strategies
University of Central Florida
12
Function
Category
Function
Name
Function
Description
Data
independent
intlut
Convert integer values using lookup tableimadjustAdjust image intensity values
imlincomb
Linear
combination of images
Data sharing
edge
Find
edges in grayscale imageimregionalmaxRegional maxima of an image
ordfilt2
2-D
order-statistic filtering
conv2
2D
convolution of an image
mean2
Average
of matrix elements
imdilate
/
imerode
Dilate/erode
a grayscale image
(C)
Algorithm
dependent
bwdist
Euclidean
distance
transform of
a
binary image
radon
Radon
transformData dependent ditherRepresent grayscale images in binary format
Data Independent
Strategies:
effectively utilize bandwidth by packing multiple pixels,
perform multiple such light-weight tasks if possible to
amortize the CPU-GPU data transfer overhead
Characteristics:
straightforward one on one mapping, abundant parallelismSlide13
MATLAB IPT Function Classification and
Optimization Strategies
University of Central Florida
13
Function
Category
Function
Name
Function
Description
Data
independent
intlut
Convert integer values using lookup tableimadjustAdjust image intensity values
imlincomb
Linear
combination of images
Data sharing
edge
Find
edges in grayscale imageimregionalmaxRegional maxima of an image
ordfilt2
2-D
order-statistic filtering
conv2
2D
convolution of an image
mean2
Average
of matrix elements
imdilate
/
imerode
Dilate/erode
a grayscale image
(C)
Algorithm
dependent
bwdist
Euclidean
distance
transform of
a
binary image
radon
Radon
transformData dependent ditherRepresent grayscale images in binary format
Data sharing
Strategies:
data reuse, computation reuse
Characteristics:
still one on one mapping, but there is an overlapping over input pixels for computing adjacent output pixelSlide14
MATLAB IPT Function Classification and
Optimization Strategies
University of Central Florida
14
Function
Category
Function
Name
Function
Description
Data
independent
intlut
Convert integer values using lookup tableimadjustAdjust image intensity values
imlincomb
Linear
combination of images
Data sharing
edge
Find
edges in grayscale imageimregionalmaxRegional maxima of an image
ordfilt2
2-D
order-statistic filtering
conv2
2D
convolution of an image
mean2
Average
of matrix elements
imdilate
/
imerode
Dilate/erode
a grayscale image
(C)
Algorithm
dependent
bwdist
Euclidean
distance
transform of
a
binary image
radon
Radon
transformData dependent ditherRepresent grayscale images in binary format
Algorithm dependent
Strategies:
re-think algorithms, explore inherent parallelism
Characteristics:
lack of explicit parallelismSlide15
MATLAB IPT Function Classification and
Optimization Strategies
University of Central Florida
15
Function
Category
Function
Name
Function
Description
Data
independent
intlut
Convert integer values using lookup tableimadjustAdjust image intensity values
imlincomb
Linear
combination of images
Data sharing
edge
Find
edges in grayscale imageimregionalmaxRegional maxima of an image
ordfilt2
2-D
order-statistic filtering
conv2
2D
convolution of an image
mean2
Average
of matrix elements
imdilate
/
imerode
Dilate/erode
a grayscale image
(C)
Algorithm
dependent
bwdist
Euclidean
distance
transform of
a
binary image
radon
Radon
transformData dependent ditherRepresent grayscale images in binary format
DataDependent
Strategies:
give it a shot and you might have some surprise
Characteristics:
lack of explicit parallelism, sequential nature with data dependency and fine-grain communication requirementsSlide16
Summary of the Library
Performance Comparison against MATLAB CPU (
single-threaded)
University of Central Florida
16
Function
Category
Function
Name
Kernel
Speedup on GTX 280
Kernel
Speedup
on
HD5870
CUDAOpenCLOpenCL
Data
independent
intlut
17.7
17.5
12.7imadjust
21.4
15.7
11.9
imlincomb
944.6
593.7
1385.4
Data
sharing
edge
3385.9
1175.2
4955.1
imregionalmax
2117.8
798.4
3694.0
ordfilt2
1199.6
171.6
1727.1
conv2
345.5
156.9
649.8
mean2
50.5
25.2
34.7
imdilate
/
imerode
951.5
523.3
1579.8
(C) Algorithm
dependent
bwdist
134.8
126.2
104.3
radon
84.3
67.4
61.2
Data
dependent
dither
10.2
6.5
7.6Slide17
Summary of the Library
Performance Comparison against MATLAB CPU (
single-threaded)
University of Central Florida
17
Function
Category
Function
Name
Kernel
Speedup on GTX 280
Kernel
Speedup
on
HD5870
CUDAOpenCL
OpenCL
Data
independent
intlut
17.7
17.5
12.7
imadjust
21.4
15.7
11.9
imlincomb
944.6
593.7
1385.4
Data
sharing
edge
3385.9
1175.2
4955.1
imregionalmax
2117.8
798.4
3694.0
ordfilt2
1199.6
171.6
1727.1
conv2
345.5
156.9
649.8
mean2
50.5
25.2
34.7
imdilate
/
imerode
951.5
523.3
1579.8
(C) Algorithm
dependent
bwdist
134.8
126.2
104.3
radon
84.3
67.4
61.2
Data
dependent
dither
10.2
6.5
7.6
Geometric
mean
206x
110x
218x
Kernel speedup on
GTX 280
Kernel
speedup on
HD 5870
CUDA
OpenCL
OpenCLSlide18
Summary of the Library
Performance Comparison against MATLAB CPU (
single-threaded)
University of Central Florida
18
Function
Category
Function
Name
Kernel
Speedup on GTX 280
Kernel
Speedup
on
HD5870
CUDAOpenCL
OpenCL
Data
independent
intlut
17.7
17.5
12.7
imadjust
21.4
15.7
11.9
imlincomb
944.6
593.7
1385.4
Data
sharing
edge
3385.9
1175.2
4955.1
imregionalmax
2117.8
798.4
3694.0
ordfilt2
1199.6
171.6
1727.1
conv2
345.5
156.9
649.8
mean2
50.5
25.2
34.7
imdilate
/
imerode
951.5
523.3
1579.8
(C) Algorithm
dependent
bwdist
134.8
126.2
104.3
radon
84.3
67.4
61.2
Data
dependent
dither
10.2
6.5
7.6
Function name
CUDA
on GTX 280
OpenCL
on HD 5870
imlincomb
944.6
1385.4
edge
3385.9
4955.1
imregionalmax
2117.8
3694.0
ordfilt2
1199.6
1727.1
conv2
345.5
649.8
imdilate951.5
1579.8Slide19
Summary of the Library
Performance Comparison against MATLAB CPU (
single-threaded)
University of Central Florida
19
Function
Category
Function
Name
Kernel
Speedup on GTX 280
Kernel
Speedup
on
HD5870
CUDAOpenCL
OpenCL
Data
independent
intlut
17.7
17.5
12.7
imadjust
21.4
15.7
11.9
imlincomb
944.6
593.7
1385.4
Data
sharing
edge
3385.9
1175.2
4955.1
imregionalmax
2117.8
798.4
3694.0
ordfilt2
1199.6
171.6
1727.1
conv2
345.5
156.9
649.8
mean2
50.5
25.2
34.7
imdilate
/
imerode
951.5
523.3
1579.8
(C) Algorithm
dependent
bwdist
134.8
126.2
104.3
radon
84.3
67.4
61.2
Data
dependent
dither
10.2
6.5
7.6
Geometric
mean
206x
110x
Kernel speedup on
GTX 280
CUDA
OpenCLSlide20
2D Convolution Overview
University of Central Florida
20
1
2
4
5
9
6
3
7
8
1
1
1
11
1
2
2
13 x 3 filter55
input pixels
output pixelsSlide21
2D Convolution Overview
University of Central Florida
21
Drag the filter over the each pixel of the source image and multiply and accumulate the overlapped input elements to generate an output pixel.
Input Image
filter
pixelSlide22
2D Convolution: Intra-Thread Data Reuse
University of Central Florida
22
Each thread computes multiple pixels along the column
Intra-Thread reuse:
For a 7x7 filter we reuse each input pixel up to 7 times
22
Input Image
Thread
i
Thread
i
Thread
iSlide23
2D Convolution: Inter-Thread Data Reuse
University of Central Florida
23
Threads in the same warp/
wavefront
access the same row.
Inter-thread reuse
The row is fetched into texture cache/shared memory and reused by different threads on subsequent accesses.
Input Image
0
1
Reused row in texture cache/shared memory
2
3
threadsSlide24
2D Convolution Performance
A 4096 x 4096 image with a 7 x 7 filter
Jacket ‘s:
around 20 GFLOPS on GTX 280Jacket 1.2.2 trial version (released on 1/4/2010) from Accelereyes
®Ours:around 350 GFLOPS on GTX 280
around 733 GFLOPS on HD 5870
University of Central Florida
24Slide25
Data Dependent Case Study: Dither
University of Central Florida
25Slide26
Dither
University of Central Florida
26
0/1?
input pixels
230 < 128?
1
230
error
Error = 230 – 128 = 102
output pixelsSlide27
Dither – Data Dependency
University of Central Florida
27
j
i
i+j
pixel at (
i
, j)Slide28
Dither – Parallel Processing Schedule
University of Central Florida
28
...
1
2
4
5
3
6
7
3
4
6
7
5
9
10
5
6
9
7
10
11
12
7
10
11
9
12
13
14
9
10
12
13
11
14
15
16
11
12
14
15
13
16
17
18
13
14
16
17
15
18
19
20
15
16
18
19
17
20
21
22
8
8
8
8
From P. Metaxas [8]
...Slide29
Dither – Our GPU Implementation
University of Central Florida
29
1
2
3
4
5
4
5
6
7
8
7
8
9
10
11
10
11
12
13
14
1
2
3
4
4
5
5
A relatively small amount of thread blocks/threads are active at any given time
low resource utilization
synchronization overhead (among thread blocks/threads)
We still get up to 10.3x kernel speedup and 3.5x overall speedup!Slide30
Conclusions
We identify performance-critical hardware features for GPGPU programs
We present our experience and optimization strategies in developing high performance GPU code for functions from MATLAB Image Processing Toolbox
University of Central Florida
30Slide31
Our Open-source Library Project Website
https://sites.google.com/site/iptatiproject/
[15]
You are more than welcome to contribute!
University of Central Florida
31
Thank you and Questions?