D Nehab 1 A Maximo 1 R S Lima 2 H Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research Linear shiftinvariant filters But use feedback from earlier outputs ID: 374228
Download Presentation The PPT/PDF document "GPU-Efficient Recursive Filtering and Su..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
GPU-Efficient Recursive Filtering and Summed-Area Tables
D. Nehab
1
A. Maximo
1
R. S. Lima
2
H. Hoppe
3
1
IMPA
2
Digitok
3
Microsoft ResearchSlide2
Linear, shift-invariant filters
But use feedback from earlier outputs
Recursive filters
input
output
prologueSlide3
Linear, shift-invariant
filters
But use feedback from earlier outputs
S
equential dependency chain
output
input
prologue
Recursive filtersSlide4
Applications of recursive filtering
B-Spline (or other) interpolation
input
coefficients
interpolation
(from coefficients)
recursive
preprocessing
stepSlide5
Applications of recursive filtering
B-Spline (or other) interpolation
Fast, wide, Gaussian-blur approximation
Summed-area
tables
input
blurred
recursive filtersSlide6
Recursive filters can be
causal
or
anticausalCausal goes forward, anticausal
in reverse direction
Filter order is simply the number r of feedbacks
Causality and order
input
epilogue
outputSlide7
Independent columns
Causal
Anticausal
Independent rows
Causal
Anticausal
Filter sequences and separability
Often,
sequences
of recursive filters are neededSlide8
Algorithm RT
The baseline algorithm
Process columns in parallel, then rows in parallel
Ruijters
et al. 2010 “GPU prefilter […]”
input
output
stages
column
processing
row processingSlide9
First-order filter benchmarks
Alg. RT is the baseline implementation
Ruijters
et al. 2010
“GPU
prefilter […]”
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
5
6
7
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
RT
Cubic B-Spline Interpolation
(GeForce GTX 480)
Alg.
Step
Complexity
Max. # of Threads
Used
Bandwidth
RTSlide10
Optimization roadmap
Modern GPUs have
several hundred
cores
Latency-hiding
requires many times more tasksImages are not large enough: must parallelize further
Alg.
Step
Complexity
Max. # of Threads
Used
Bandwidth
RTSlide11
Similar to parallel prefix-sum algorithms
Sengupta
et al. 2007 “Scan primitives for GPU computing”
Dotsenko et al. 2008 “Fast scan algorithms […]
”
Compute and store incomplete prologuesFix incomplete prologuesSomewhat more complicated than a recursive invocationUse prologues to compute and store causal results
Increasing parallelism
…
…
✗
✗
✗
✗
…
…
…Slide12
✗
Fixing incomplete prologues
…
…
…
superposition
linearitySlide13
Algorithm 2
Adds block parallelism
Sung et al. 1986 “Efficient […] recursive […]”, or
Blelloch
1990 “Prefix sums […]”
+ tricks from GPU parallel scan algorithms
input
output
stages
fix
fix
fix
fixSlide14
First-order filter benchmarks
Alg. RT is the baseline implementation
Ruijters
et al. 2010
“GPU
prefilter […]”
Alg. 2 adds block parallelism & tricksSung et al. 1986 “Efficient […] recursive […]”Blelloch
1990 “Prefix sums […]”
+ tricks from GPU parallel scan algorithms
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
5
6
7
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
RT
2
Cubic B-Spline Interpolation
(GeForce GTX 480)
Alg.
Step
Complexity
Max. # of Threads
Memory
Bandwidth
2
RTSlide15
Optimization roadmap
Modern GPUs have
several hundred
cores
Latency-hiding requires
many times more tasks
Images are not large enough: must parallelize furtherFLOP/IO ratio of recursive filters is too lowCan use even more FLOPs but must reduce IOTo do so, we introduce overlapping
Alg.
Step
Complexity
Max. # of Threads
Memory
Bandwidth
2
RTSlide16
Causal-anticausal
overlapping
Start
anticausal
processing before causal is done
Saves reading and writing causal results!Compute and store incomplete prologues & epiloguesFix incomplete prologues & twice-incomplete epiloguesTwice-incomplete epilogues are trickierUse them to compute and store
anticausal results
…
…Slide17
Fixing twice-incomplete epilogues
Repeatedly apply linearity and superposition
Tedious derivation, simple result
t
wice-incomplete epilogue
c
orrected prologue
c
orrected epilogueSlide18
Algorithm 4
Adds causal-
anticausal
overlapping
Eliminates reading and writing causal results
Both in column and in row processingModest increase in computation
input
output
stages
fix both
fix bothSlide19
Alg.
Step
Complexity
Max. # of Threads
Memory
Bandwidth
4
2
RT
First-order
filter benchmarks
Alg. RT is the baseline implementation
Ruijters
et al. 2010
“GPU
prefilter
[…]
”
Alg. 2 adds block parallelism & tricks
Sung et al. 1986 “Efficient […]
recursive […]”
Blelloch
1990 “Prefix sums […]”
+ tricks from GPU parallel scan algorithms
Alg.
4 adds causal-
anticausal
overlapping
Eliminates 4hw of IO
Modest increase in computation
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
5
6
7
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
RT
2
4
Cubic B-Spline Interpolation
(GeForce GTX 480)Slide20
Algorithm 5
Adds
row-column
overlapping
Eliminates reading and writing
column resultsModest increase in computation
input
output
stages
f
ix all!Slide21
Start from input and global bordersSlide22
Load blocks into shared memorySlide23
Compute & store incomplete bordersSlide24
Compute & store incomplete bordersSlide25
Compute & store incomplete bordersSlide26
Compute & store incomplete bordersSlide27
Compute & store incomplete bordersSlide28
Compute & store incomplete bordersSlide29
Compute & store incomplete bordersSlide30
Compute & store incomplete bordersSlide31
All borders in global memorySlide32
Fix incomplete bordersSlide33
Fix twice-incomplete bordersSlide34
Fix thrice-incomplete bordersSlide35
Fix four-times-incomplete bordersSlide36
Done fixing all bordersSlide37
Load blocks into shared memorySlide38
Finish causal columnsSlide39
Finish anticausal
columnsSlide40
Finish causal rowsSlide41
Finish anticausal
rowsSlide42
Store results to global memorySlide43
Done!Slide44
Fixing thrice-incomplete row-prologues
Fixing four-times-incomplete row-epilogues
Row-column overlapping rulesSlide45
First-order filter benchmarks
Alg. RT is the baseline implementation
Ruijters
et al. 2010
“GPU
prefilter […]”
Alg. 2 adds block parallelism & tricksSung et al. 1986 “Efficient […] recursive […]”
Blelloch 1990 “Prefix sums […]”
+ tricks from GPU parallel scan algorithmsAlg. 4 adds causal-
anticausal overlappingEliminates 4hw of IOModest increase in computation
Alg. 5 adds row-column overlapping
Eliminates additional 2hw of IO
Modest increase in computation
Alg.
Step
Complexity
Max. # of Threads
Memory
Bandwidth
5
4
2
RT
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
5
6
7
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
RT
2
4
5
Cubic B-Spline Interpolation
(GeForce GTX 480)Slide46
Second-order filter benchmarks
Alg.
4
2
uses causal
-anticausal overlappingAlg. 5
2 adds row-column overlappingAdded complexity outweighs IO reductionBalance
will change (hardware, compiler, implementation)
Alg.
Step
Complexity
Max. # of Threads
Memory
Bandwidth
4
2
5
2
1
2
3
4
5
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
5
2
4
2
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
Quintic
B-Spline Interpolation
(GeForce GTX 480)Slide47
CUFFT is in frequency domain
complexity
DIR
is direct
convolution
complexityPodlozhnyuk 2007 whitepaper“Image convolution with CUDA”
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
DIR
2.5
DIR
5
DIR
10
Overlapped Recursive
CUFFT
Gaussian blur results
Overlapped recursive
3
rd
order
approximation
complexity
van
Vliet
et al.
1998
“
Recursive Gaussian
derivative
filters
”
Implemented as 5
1
fused with 4
2
Recursive
approximation is faster
Even for modest size images
Also modest
standard-
deviations
Gaussian Blur
(GeForce GTX 480)Slide48
Summed-area table benchmarks
Harris et al
2008, GPU Gems 3
“Parallel prefix-scan […]”
M
ulti
-scan
+ transpose +
multiscan
Implemented with CUDPP
Hensley 2010,
Gamefest
“High-quality depth of field”
Multi-wave method
Our improvements
+ specialized row and column kernels
+ save
only
incomplete borders
+ fuse row and column stages
Overlapped SAT
Row-column overlapping
u
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
I
n
p
t
s
i
z
e
(
p
i
x
e
l
s
)
2
2
2
2
2
2
2
1
2
3
4
5
6
7
8
9
T
h
r
o
u
g
h
p
u
t
(
G
i
P
/
s
)
Summed-area Table
(GeForce GTX 480)
Harris et al [2008]
Hensley [2010]
Improved Hensley [2010]
Overlapped SAT
First-order filter, unit coefficient, no
anticausal
componentSlide49
Future work
Volumetric processing
Overlapping should generalize
Not enough shared memory (yet?)
CPU implementationBlocking should increase L1 cache effectiveness
Is doubling amount of computation worth it?Solving general narrow-banded linear systemsOverlapping back- and forward- substitutionSlide50
Conclusions
Recursive filters are useful
in many applications
Cubic and quintic
B-Spline interpolationGaussian-blur approximation
Summed-area table computation We introduced parallel algorithms for GPUsOverlapping reduces IO requirementsLeads to faster algorithmsCode is available from project page
Most is already there, rest is on the waySlide51
Questions?
baseline
Alg. RT (0.5
GiP
/s)
+ block parallelism
Alg. 2 (3
GiP
/s)
+ causal-
anticausal
overlapping
Alg. 4 (5
GiP
/s
)
+ row-column overlapping
Alg. 5 (
6
GiP
/s
)Slide52
Independent columns
Causal
Anticausal
Independent rows
Causal
Anticausal
Filter sequences and separability
Often,
sequences
of recursive filters are neededSlide53
Further applications
Exponential
smoothing of time series
General infinite impulse-response filters (IIR)
Inverse of finite-support convolutionBack-substitution
on banded Toeplitz matricesSlide54
Fixing twice-incomplete epilogues
Repeatedly apply linearity and superposition
Tedious derivation, simple result
t
wice-incomplete
epilogue
fixed
prologue
fixed
epilogue