GPU-Efficient Recursive Filtering and Summed-Area Tables - PowerPoint Presentation

423 views
Uploaded On 2016-06-23

GPU-Efficient Recursive Filtering and Summed-Area Tables - PPT Presentation

D Nehab 1 A Maximo 1 R S Lima 2 H Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research Linear shiftinvariant filters But use feedback from earlier outputs ID: 374228

alg incomplete borders recursive incomplete alg recursive borders store causal amp compute anticausal gpu row column filters fix input

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/374228" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "GPU-Efficient Recursive Filtering and Su..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

GPU-Efficient Recursive Filtering and Summed-Area Tables

D. Nehab

A. Maximo

R. S. Lima

H. Hoppe

IMPA

Digitok

Microsoft ResearchSlide2

Linear, shift-invariant filters

But use feedback from earlier outputs

Recursive filters

input

output

prologueSlide3

Linear, shift-invariant

filters

But use feedback from earlier outputs

equential dependency chain

output

input

prologue

Recursive filtersSlide4

Applications of recursive filtering

B-Spline (or other) interpolation

input

coefficients

interpolation

(from coefficients)

recursive

preprocessing

stepSlide5

Applications of recursive filtering

B-Spline (or other) interpolation

Fast, wide, Gaussian-blur approximation

Summed-area

tables

input

blurred

recursive filtersSlide6

Recursive filters can be

causal

anticausalCausal goes forward, anticausal

in reverse direction

Filter order is simply the number r of feedbacks

Causality and order

input

epilogue

outputSlide7

Independent columns

Causal

Anticausal

Independent rows

Causal

Anticausal

Filter sequences and separability

Often,

sequences

of recursive filters are neededSlide8

Algorithm RT

The baseline algorithm

Process columns in parallel, then rows in parallel

Ruijters

et al. 2010 “GPU prefilter […]”

input

output

stages

column

processing

row processingSlide9

First-order filter benchmarks

Alg. RT is the baseline implementation

Ruijters

et al. 2010

“GPU

prefilter […]”

(

)

(

)

Cubic B-Spline Interpolation

(GeForce GTX 480)

Alg.

Step

Complexity

Max. # of Threads

Used

Bandwidth

RTSlide10

Optimization roadmap

Modern GPUs have

several hundred

cores

Latency-hiding

requires many times more tasksImages are not large enough: must parallelize further

Alg.

Step

Complexity

Max. # of Threads

Used

Bandwidth

RTSlide11

Similar to parallel prefix-sum algorithms

Sengupta

et al. 2007 “Scan primitives for GPU computing”

Dotsenko et al. 2008 “Fast scan algorithms […]

”

Compute and store incomplete prologuesFix incomplete prologuesSomewhat more complicated than a recursive invocationUse prologues to compute and store causal results

Increasing parallelism

…

✗

…

…Slide12

✗

Fixing incomplete prologues

…

superposition

linearitySlide13

Algorithm 2

Adds block parallelism

Sung et al. 1986 “Efficient […] recursive […]”, or

Blelloch

1990 “Prefix sums […]”

+ tricks from GPU parallel scan algorithms

input

output

stages

fix

fixSlide14

First-order filter benchmarks

Alg. RT is the baseline implementation

Ruijters

et al. 2010

“GPU

prefilter […]”

Alg. 2 adds block parallelism & tricksSung et al. 1986 “Efficient […] recursive […]”Blelloch

1990 “Prefix sums […]”

+ tricks from GPU parallel scan algorithms

(

)

(

)

Cubic B-Spline Interpolation

(GeForce GTX 480)

Alg.

Step

Complexity

Max. # of Threads

Memory

Bandwidth

RTSlide15

Optimization roadmap

Modern GPUs have

several hundred

cores

Latency-hiding requires

many times more tasks

Images are not large enough: must parallelize furtherFLOP/IO ratio of recursive filters is too lowCan use even more FLOPs but must reduce IOTo do so, we introduce overlapping

Alg.

Step

Complexity

Max. # of Threads

Memory

Bandwidth

RTSlide16

Causal-anticausal

overlapping

Start

anticausal

processing before causal is done

Saves reading and writing causal results!Compute and store incomplete prologues & epiloguesFix incomplete prologues & twice-incomplete epiloguesTwice-incomplete epilogues are trickierUse them to compute and store

anticausal results

…

…Slide17

Fixing twice-incomplete epilogues

Repeatedly apply linearity and superposition

Tedious derivation, simple result

wice-incomplete epilogue

orrected prologue

orrected epilogueSlide18

Algorithm 4

Adds causal-

anticausal

overlapping

Eliminates reading and writing causal results

Both in column and in row processingModest increase in computation

input

output

stages

fix both

fix bothSlide19

Alg.

Step

Complexity

Max. # of Threads

Memory

Bandwidth

First-order

filter benchmarks

Alg. RT is the baseline implementation

Ruijters

et al. 2010

“GPU

prefilter

[…]

”

Alg. 2 adds block parallelism & tricks

Sung et al. 1986 “Efficient […]

recursive […]”

Blelloch

1990 “Prefix sums […]”

+ tricks from GPU parallel scan algorithms

Alg.

4 adds causal-

anticausal

overlapping

Eliminates 4hw of IO

Modest increase in computation

(

)

(

)

Cubic B-Spline Interpolation

(GeForce GTX 480)Slide20

Algorithm 5

Adds

row-column

overlapping

Eliminates reading and writing

column resultsModest increase in computation

input

output

stages

ix all!Slide21

Start from input and global bordersSlide22

Load blocks into shared memorySlide23

Compute & store incomplete bordersSlide24

Compute & store incomplete bordersSlide25

Compute & store incomplete bordersSlide26

Compute & store incomplete bordersSlide27

Compute & store incomplete bordersSlide28

Compute & store incomplete bordersSlide29

Compute & store incomplete bordersSlide30

Compute & store incomplete bordersSlide31

All borders in global memorySlide32

Fix incomplete bordersSlide33

Fix twice-incomplete bordersSlide34

Fix thrice-incomplete bordersSlide35

Fix four-times-incomplete bordersSlide36

Done fixing all bordersSlide37

Load blocks into shared memorySlide38

Finish causal columnsSlide39

Finish anticausal

columnsSlide40

Finish causal rowsSlide41

Finish anticausal

rowsSlide42

Store results to global memorySlide43

Done!Slide44

Fixing thrice-incomplete row-prologues

Fixing four-times-incomplete row-epilogues

Row-column overlapping rulesSlide45

First-order filter benchmarks

Alg. RT is the baseline implementation

Ruijters

et al. 2010

“GPU

prefilter […]”

Alg. 2 adds block parallelism & tricksSung et al. 1986 “Efficient […] recursive […]”

Blelloch 1990 “Prefix sums […]”

+ tricks from GPU parallel scan algorithmsAlg. 4 adds causal-

anticausal overlappingEliminates 4hw of IOModest increase in computation

Alg. 5 adds row-column overlapping

Eliminates additional 2hw of IO

Modest increase in computation

Alg.

Step

Complexity

Max. # of Threads

Memory

Bandwidth

(

)

(

)

Cubic B-Spline Interpolation

(GeForce GTX 480)Slide46

Second-order filter benchmarks

Alg.

uses causal

-anticausal overlappingAlg. 5

2 adds row-column overlappingAdded complexity outweighs IO reductionBalance

will change (hardware, compiler, implementation)

Alg.

Step

Complexity

Max. # of Threads

Memory

Bandwidth

(

)

(

)

Quintic

B-Spline Interpolation

(GeForce GTX 480)Slide47

CUFFT is in frequency domain

complexity

DIR

is direct

convolution

complexityPodlozhnyuk 2007 whitepaper“Image convolution with CUDA”

(

)

(

)

DIR

2.5

DIR

Overlapped Recursive

CUFFT

Gaussian blur results

Overlapped recursive

order

approximation

complexity

van

Vliet

et al.

1998

“

Recursive Gaussian

derivative

filters

”

Implemented as 5

fused with 4

Recursive

approximation is faster

Even for modest size images

Also modest

standard-

deviations

Gaussian Blur

(GeForce GTX 480)Slide48

Summed-area table benchmarks

Harris et al

2008, GPU Gems 3

“Parallel prefix-scan […]”

ulti

-scan

+ transpose +

multiscan

Implemented with CUDPP

Hensley 2010,

Gamefest

“High-quality depth of field”

Multi-wave method

Our improvements

+ specialized row and column kernels

+ save

only

incomplete borders

+ fuse row and column stages

Overlapped SAT

Row-column overlapping

(

)

(

)

Summed-area Table

(GeForce GTX 480)

Harris et al [2008]

Hensley [2010]

Improved Hensley [2010]

Overlapped SAT

First-order filter, unit coefficient, no

anticausal

componentSlide49

Future work

Volumetric processing

Overlapping should generalize

Not enough shared memory (yet?)

CPU implementationBlocking should increase L1 cache effectiveness

Is doubling amount of computation worth it?Solving general narrow-banded linear systemsOverlapping back- and forward- substitutionSlide50

Conclusions

Recursive filters are useful

in many applications

Cubic and quintic

B-Spline interpolationGaussian-blur approximation

Summed-area table computation We introduced parallel algorithms for GPUsOverlapping reduces IO requirementsLeads to faster algorithmsCode is available from project page

Most is already there, rest is on the waySlide51

Questions?

baseline

Alg. RT (0.5

GiP

/s)

+ block parallelism

Alg. 2 (3

GiP

/s)

+ causal-

anticausal