/
Accelerating MATLAB Image Processing Toolbox Functions on GPUs Accelerating MATLAB Image Processing Toolbox Functions on GPUs

Accelerating MATLAB Image Processing Toolbox Functions on GPUs - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
357 views
Uploaded On 2018-10-04

Accelerating MATLAB Image Processing Toolbox Functions on GPUs - PPT Presentation

Jingfei Kong Martin Dimitrov Yi Yang Janaka Liyanage Lin Cao Jacob Staples Mike Mantor Huiyang Zhou Motivation With high memory bandwidth and teraflops computing capability Graphics Processor Units GPUs become quite attractive for accelerating general purpose app ID: 684170

image data university memory data image memory university central florida function work implies tile registers core thread shared access

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Accelerating MATLAB Image Processing Too..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Accelerating MATLAB Image Processing Toolbox Functions on GPUs

Jingfei

Kong

, Martin

Dimitrov

, Yi Yang,

Janaka

Liyanage

, Lin Cao, Jacob Staples, Mike

Mantor

,

Huiyang

ZhouSlide2

Motivation

With high memory bandwidth and teraflops computing capability, Graphics Processor Units (GPUs) become quite attractive for accelerating general purpose applications

Developing high-performance GPU programs, however, requires deep understanding of both application algorithms and GPU hardware architecture

A systematic way of dealing with a generic class of applications is missing

University of Central Florida

2Slide3

Our Contributions

Compare performance-critical hardware features in different GPUs

Develop high-quality open-source library code for some representative functions in MATLAB™ Image Processing Toolbox (IPT)

https://sites.google.com/site/iptatiproject/ [15]Reveal insights on efficiently accelerating a wide range of image processing algorithms

University of Central Florida

3Slide4

Presentation Outline

Motivation

Our Contributions

Implication of GPU hardware on GPGPU programmingA GPGPU library for IPT functionscategorization and optimization strategies

Case Studies2D convolutionditherConclusions

University of Central Florida

4Slide5

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI

HD5870

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data itemStream processor pipeline5-way VLIW ILP needed to make ALUs busyScalar pipeline less ILP needed

University of Central Florida

5Slide6

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI HD5870 (RV870)

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

University of Central Florida

6

Memory Access BandwidthVector-type (float2 or float4) data accessScalar-type (float) or vector-type (float2) data access

Our experiments show:

Using float instead would reduce bandwidth by at least 10%

Using float4 instead would reduce bandwidth by at least 16%Slide7

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI HD5870 (RV870)

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

University of Central Florida

7

Register File 256KB per SIMD engine * 20 SIMD engines = 5MB in total

64KB per SM * 30 SMs = 1.875MB in totalSlide8

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI HD5870 (RV870)

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

University of Central Florida

8

Shared memory/ local data share

32KB per SIMD engine * 20 SIMD

engines = 640KB in total

16KB per SM * 30 SMs = 480KB in totalSlide9

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI HD5870

(RV870)

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

University of Central Florida

9Ratio of Peak Computation Throughput / Peak Memory Bandwidth

(2720 GFLOPS)/(

154 GB/s

) =

17.7 flop/B

= 70.6 flop/word

(624

G

FLOPS

)/(141 GB/s

) = 4.4 flop/B

= 17.7 flop/wordSlide10

Implication of GPU hardware on GPGPU programming

Performance-Critical Hardware Features

Implication on GPGPU Programs

AMD/ATI HD5870

(RV870)

NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File

A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

University of Central Florida

10Stream processor pipeline5-way VLIW

Scalar

pipelineSlide11

Summary of the LibraryMATLAB Image Processing Toolbox (IPT) Function Classification

University of Central Florida

11

Function

Category

Function

Name

Function

Description

Data

independent

intlut

Convert

integer values using lookup tableimadjustAdjust image intensity valuesimlincombLinear combination of imagesData sharing

edge

Find

edges in grayscale image

imregionalmax

Regional maxima of an image

ordfilt2

2-D order-statistic filtering conv22D convolution of an image

mean2

Average

of matrix elements

imdilate

/

imerode

Dilate/erode

a grayscale image

(C)

Algorithm

dependent

bwdist

Euclidean

distance

transform of

a

binary image

radon

Radon

transform

Data

dependent

dither

Represent

grayscale images in

binary

formatSlide12

MATLAB IPT Function Classification and

Optimization Strategies

University of Central Florida

12

Function

Category

Function

Name

Function

Description

Data

independent

intlut

Convert integer values using lookup tableimadjustAdjust image intensity values

imlincomb

Linear

combination of images

Data sharing

edge

Find

edges in grayscale imageimregionalmaxRegional maxima of an image

ordfilt2

2-D

order-statistic filtering

conv2

2D

convolution of an image

mean2

Average

of matrix elements

imdilate

/

imerode

Dilate/erode

a grayscale image

(C)

Algorithm

dependent

bwdist

Euclidean

distance

transform of

a

binary image

radon

Radon

transformData dependent ditherRepresent grayscale images in binary format

Data Independent

Strategies:

effectively utilize bandwidth by packing multiple pixels,

perform multiple such light-weight tasks if possible to

amortize the CPU-GPU data transfer overhead

Characteristics:

straightforward one on one mapping, abundant parallelismSlide13

MATLAB IPT Function Classification and

Optimization Strategies

University of Central Florida

13

Function

Category

Function

Name

Function

Description

Data

independent

intlut

Convert integer values using lookup tableimadjustAdjust image intensity values

imlincomb

Linear

combination of images

Data sharing

edge

Find

edges in grayscale imageimregionalmaxRegional maxima of an image

ordfilt2

2-D

order-statistic filtering

conv2

2D

convolution of an image

mean2

Average

of matrix elements

imdilate

/

imerode

Dilate/erode

a grayscale image

(C)

Algorithm

dependent

bwdist

Euclidean

distance

transform of

a

binary image

radon

Radon

transformData dependent ditherRepresent grayscale images in binary format

Data sharing

Strategies:

data reuse, computation reuse

Characteristics:

still one on one mapping, but there is an overlapping over input pixels for computing adjacent output pixelSlide14

MATLAB IPT Function Classification and

Optimization Strategies

University of Central Florida

14

Function

Category

Function

Name

Function

Description

Data

independent

intlut

Convert integer values using lookup tableimadjustAdjust image intensity values

imlincomb

Linear

combination of images

Data sharing

edge

Find

edges in grayscale imageimregionalmaxRegional maxima of an image

ordfilt2

2-D

order-statistic filtering

conv2

2D

convolution of an image

mean2

Average

of matrix elements

imdilate

/

imerode

Dilate/erode

a grayscale image

(C)

Algorithm

dependent

bwdist

Euclidean

distance

transform of

a

binary image

radon

Radon

transformData dependent ditherRepresent grayscale images in binary format

Algorithm dependent

Strategies:

re-think algorithms, explore inherent parallelism

Characteristics:

lack of explicit parallelismSlide15

MATLAB IPT Function Classification and

Optimization Strategies

University of Central Florida

15

Function

Category

Function

Name

Function

Description

Data

independent

intlut

Convert integer values using lookup tableimadjustAdjust image intensity values

imlincomb

Linear

combination of images

Data sharing

edge

Find

edges in grayscale imageimregionalmaxRegional maxima of an image

ordfilt2

2-D

order-statistic filtering

conv2

2D

convolution of an image

mean2

Average

of matrix elements

imdilate

/

imerode

Dilate/erode

a grayscale image

(C)

Algorithm

dependent

bwdist

Euclidean

distance

transform of

a

binary image

radon

Radon

transformData dependent ditherRepresent grayscale images in binary format

DataDependent

Strategies:

give it a shot and you might have some surprise

Characteristics:

lack of explicit parallelism, sequential nature with data dependency and fine-grain communication requirementsSlide16

Summary of the Library

Performance Comparison against MATLAB CPU (

single-threaded)

University of Central Florida

16

Function

Category

Function

Name

Kernel

Speedup on GTX 280

Kernel

Speedup

on

HD5870

CUDAOpenCLOpenCL

Data

independent

intlut

17.7

17.5

12.7imadjust

21.4

15.7

11.9

imlincomb

944.6

593.7

1385.4

Data

sharing

edge

3385.9

1175.2

4955.1

imregionalmax

2117.8

798.4

3694.0

ordfilt2

1199.6

171.6

1727.1

conv2

345.5

156.9

649.8

mean2

50.5

25.2

34.7

imdilate

/

imerode

951.5

523.3

1579.8

(C) Algorithm

dependent

bwdist

134.8

126.2

104.3

radon

84.3

67.4

61.2

Data

dependent

dither

10.2

6.5

7.6Slide17

Summary of the Library

Performance Comparison against MATLAB CPU (

single-threaded)

University of Central Florida

17

Function

Category

Function

Name

Kernel

Speedup on GTX 280

Kernel

Speedup

on

HD5870

CUDAOpenCL

OpenCL

Data

independent

intlut

17.7

17.5

12.7

imadjust

21.4

15.7

11.9

imlincomb

944.6

593.7

1385.4

Data

sharing

edge

3385.9

1175.2

4955.1

imregionalmax

2117.8

798.4

3694.0

ordfilt2

1199.6

171.6

1727.1

conv2

345.5

156.9

649.8

mean2

50.5

25.2

34.7

imdilate

/

imerode

951.5

523.3

1579.8

(C) Algorithm

dependent

bwdist

134.8

126.2

104.3

radon

84.3

67.4

61.2

Data

dependent

dither

10.2

6.5

7.6

Geometric

mean

206x

110x

218x

Kernel speedup on

GTX 280

Kernel

speedup on

HD 5870

CUDA

OpenCL

OpenCLSlide18

Summary of the Library

Performance Comparison against MATLAB CPU (

single-threaded)

University of Central Florida

18

Function

Category

Function

Name

Kernel

Speedup on GTX 280

Kernel

Speedup

on

HD5870

CUDAOpenCL

OpenCL

Data

independent

intlut

17.7

17.5

12.7

imadjust

21.4

15.7

11.9

imlincomb

944.6

593.7

1385.4

Data

sharing

edge

3385.9

1175.2

4955.1

imregionalmax

2117.8

798.4

3694.0

ordfilt2

1199.6

171.6

1727.1

conv2

345.5

156.9

649.8

mean2

50.5

25.2

34.7

imdilate

/

imerode

951.5

523.3

1579.8

(C) Algorithm

dependent

bwdist

134.8

126.2

104.3

radon

84.3

67.4

61.2

Data

dependent

dither

10.2

6.5

7.6

Function name

CUDA

on GTX 280

OpenCL

on HD 5870

imlincomb

944.6

1385.4

edge

3385.9

4955.1

imregionalmax

2117.8

3694.0

ordfilt2

1199.6

1727.1

conv2

345.5

649.8

imdilate951.5

1579.8Slide19

Summary of the Library

Performance Comparison against MATLAB CPU (

single-threaded)

University of Central Florida

19

Function

Category

Function

Name

Kernel

Speedup on GTX 280

Kernel

Speedup

on

HD5870

CUDAOpenCL

OpenCL

Data

independent

intlut

17.7

17.5

12.7

imadjust

21.4

15.7

11.9

imlincomb

944.6

593.7

1385.4

Data

sharing

edge

3385.9

1175.2

4955.1

imregionalmax

2117.8

798.4

3694.0

ordfilt2

1199.6

171.6

1727.1

conv2

345.5

156.9

649.8

mean2

50.5

25.2

34.7

imdilate

/

imerode

951.5

523.3

1579.8

(C) Algorithm

dependent

bwdist

134.8

126.2

104.3

radon

84.3

67.4

61.2

Data

dependent

dither

10.2

6.5

7.6

Geometric

mean

206x

110x

Kernel speedup on

GTX 280

CUDA

OpenCLSlide20

2D Convolution Overview

University of Central Florida

20

1

2

4

5

9

6

3

7

8

1

1

1

11

1

2

2

13 x 3 filter55

input pixels

output pixelsSlide21

2D Convolution Overview

University of Central Florida

21

Drag the filter over the each pixel of the source image and multiply and accumulate the overlapped input elements to generate an output pixel.

Input Image

filter

pixelSlide22

2D Convolution: Intra-Thread Data Reuse

University of Central Florida

22

Each thread computes multiple pixels along the column

Intra-Thread reuse:

For a 7x7 filter we reuse each input pixel up to 7 times

22

Input Image

Thread

i

Thread

i

Thread

iSlide23

2D Convolution: Inter-Thread Data Reuse

University of Central Florida

23

Threads in the same warp/

wavefront

access the same row.

Inter-thread reuse

The row is fetched into texture cache/shared memory and reused by different threads on subsequent accesses.

Input Image

0

1

Reused row in texture cache/shared memory

2

3

threadsSlide24

2D Convolution Performance

A 4096 x 4096 image with a 7 x 7 filter

Jacket ‘s:

around 20 GFLOPS on GTX 280Jacket 1.2.2 trial version (released on 1/4/2010) from Accelereyes

®Ours:around 350 GFLOPS on GTX 280

around 733 GFLOPS on HD 5870

University of Central Florida

24Slide25

Data Dependent Case Study: Dither

University of Central Florida

25Slide26

Dither

University of Central Florida

26

0/1?

input pixels

230 < 128?

1

230

error

Error = 230 – 128 = 102

output pixelsSlide27

Dither – Data Dependency

University of Central Florida

27

j

i

i+j

pixel at (

i

, j)Slide28

Dither – Parallel Processing Schedule

University of Central Florida

28

...

1

2

4

5

3

6

7

3

4

6

7

5

9

10

5

6

9

7

10

11

12

7

10

11

9

12

13

14

9

10

12

13

11

14

15

16

11

12

14

15

13

16

17

18

13

14

16

17

15

18

19

20

15

16

18

19

17

20

21

22

8

8

8

8

From P. Metaxas [8]

...Slide29

Dither – Our GPU Implementation

University of Central Florida

29

1

2

3

4

5

4

5

6

7

8

7

8

9

10

11

10

11

12

13

14

1

2

3

4

4

5

5

A relatively small amount of thread blocks/threads are active at any given time

low resource utilization

synchronization overhead (among thread blocks/threads)

We still get up to 10.3x kernel speedup and 3.5x overall speedup!Slide30

Conclusions

We identify performance-critical hardware features for GPGPU programs

We present our experience and optimization strategies in developing high performance GPU code for functions from MATLAB Image Processing Toolbox

University of Central Florida

30Slide31

Our Open-source Library Project Website

https://sites.google.com/site/iptatiproject/

[15]

You are more than welcome to contribute!

University of Central Florida

31

Thank you and Questions?