ViolaJones Classifier based Face Detection Algorithm Sharmila Shridhar Vinay Gangadhar Ram Sai Manoj ECE 759 Project Presentation Fall 2015 University of Wisconsin Madison 1 Executive Summary ID: 566210
Download Presentation The PPT/PDF document "An Efficient GPGPU Implementation of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Efficient GPGPU Implementation of Viola-Jones Classifier based Face Detection Algorithm
Sharmila Shridhar, Vinay Gangadhar, Ram Sai ManojECE 759 Project PresentationFall 2015University of Wisconsin - Madison
1Slide2
Executive SummaryViola Jones Classifier based Face Detection Implemented on GPU
Detection done in 2 phases:Nearest Neighbor and Integral ImageScanning Window and HAAR Feature DetectionOptimizations applied to both phasesShared Memory, No bank conflicts, TB divergence etc.Upto 5.3x SpeedUp compared to single threaded CPU performanceGPU performs better for larger images2Slide3
IntroductionFace detection a hot algorithm Auto tagging pictures in FacebookEasy search in Google Photos
Biometric based security accessThreat activity detectionFace mapping Human Faces Similar properties (HAAR)Different type of algortihmsMotion basedColor basedViola Jones Classifier based algorithm3Slide4
MotivationFace Detection algorithms have large amount of Data Level Parallelism (DLP)
Involve processing each window separately to detect faceGPU resources can be utilized efficientlyPerformance and Energy efficient Face detection implementationThis application can be used toshowcase the benefits of GPGPU implementations
howcase incremental benefits with fine tuning of the code (optimizations)
4Slide5
OutlineIntroduction
MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion5Slide6
Background- Viola Jones Algorithm (1)Haar Feature Selection
Each feature consists of white and black rectanglesSubtract white region’s pixel sum from black region’s pixel sumIf (sum > threshold) Region has Haar FeatureWe use pre-selected Haar Features as classifiers for Face Detection
6Slide7
Background- Viola Jones Algorithm (2)Integral Image Calculation
Sum calculation for each Haar rectangle is very importantSped up by using Integral ImageIntegral Sum at any pixel
,
where
is the value of the pixel
7Slide8
Background- Viola Jones Algorithm (3)Cascade ClassifierAdaboost AlgorithmOutput of several weak classifiers (Haar Features) combined into weighted sum to form Strong Classifier
Further concatenates strong classifiers to a Cascade Classifier
Cascade Classifier
+
8Slide9
Background - Image PyramidWe use 25x25 window size for each featureBut face in the image can be of any sizeUse Image Pyramid to make face detection scale invariant
9Slide10
OutlineIntroduction
MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion10Slide11
Group rectangles
Draw rectangles around the faces
Shift the detection window
Compute the image co-ordinates for each
HAAR
feature
Compute sum of pixels
from [0,0]
[x,y]
Compute sum of squares of pixels from [0,0]
[x,y]
Image downscaling
factor
1.2
Detection window
25 X 25
Image / Scaling Factor >= 25
Read the source Image and Cascade classifier parameters
Implementation Flow
Nearest Neighbor
Integral Image
Set Image for
HAAR Detection
Run
Cascade Classifier
Image with detected Faces
Integral sum > threshold all stages
Face detected, store the co-ordinates
Integral sum < threshold for a stage
Skip this window for further stages Slide12
Nearest Neighbor (NN)
016
32
48
2
18
34
50
4
20
36
52
6
22
38
54
Computes the image pixels for the downscaled (DS) image
Used scaling factor of 1.2 in implementation
Detection window of 25 X 25 pixels
Image downscaled until it’s equal to detection window
Why downscale ??
Example:
downscaled image with scaling factor of 2
Parallelization scope
DS
image pixels
calculated
by
scale factor offset, width,
height of source
image
Map
each (or more) pixel position to a single
thread
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
233139475563
NearestNeighbor
12Slide13
Integral Image (II)
016
32
48
2
18
34
50
4
20
36
52
6
22
38
54
Sum of all the pixels above & to the left of (X, Y) for each X & Y in the image
structure
Integral Image
Prefix scan along
row
for all rows and then along column
Parallelization scope
Prefix scan of each row independent of other rows
RowScan
(RS)
Prefix scan along columns
ColumnScan
(CS)
Similarly compute square integral sum (sum of squares
)
Why ? Need the variance of the pixels for the haar rectangle co-ordinates
Var
(X
) = E(X
2
) - (E(X))
2
0
16
48
96
2
36
102
200
6
60
162
312
12
88
228
422
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
RS
C
S
13Slide14
NN + II ImplementationImplementing the
NN and II Sum in 4 separate kernelsKernel 1 Nearest Neighbor(NN) & RowScan (RS) – downscaled image readyKernel 2 Matrix TransposeKernel
3
RowScan
Kernel
4
Matrix Transpose
Integral
sum
&
square integral sum
ready at the end of kernel 4
ColumnScan
14Slide15
Kernel 1 Nearest Neighbor (NN)
& RowScan (RS)0
16
32
48
2
18
34
50
4
20
36
52
6
22
38
54
NNNN
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
NN
RS
Combine NN & RS
eliminates 1 global memory access (storing in between kernels)
K
ernel configuration:
(
w
, h – width & height of downscaled
Image
)
Threads per block = smallestpower2(w) – Constraint from RowScan algorithm
Blocks
= h
RowScan
– Inclusive prefix scan of each row in image
Harris-
Sengupta
-Owen algorithm (Upsweep & downsweep)
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5132129374553
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
0
16
32
48
Shared memory [2 * BLOCKSIZE + 1]
15Slide16
Columnscan replaced with kernel 2, 3 & 4Why ? Could have been done as Rowscan & columnscan
0
16
48
96
2
36
102
200
6
60
162
312
12
88
228
422
RS
C
S
Straightforward implementation
Kernel 2, 3 & 4
ColumnScan(CS)
0
16
32
48
2
18
34
50
4
20
36
52
6
22
38
54
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
16Slide17
Motivation for Transpose
016
48
96
2
20
54
104
4
24
60
112
6
28
66
120
Global memory reads aren’t coalesced
Time consuming (400 – 500 cycles each access !!) & each thread needs separate access
ColumnScan
– perform prefix scan along the column
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
T0
T1
T2
T3
Global memory (GM) layout of above 2D matrix
Thread 0
Thread 1
Thread 2
Thread 3
17
Alternate Method for
ColumnScan
Row 0
Row 1
Row 2
Row 3Slide18
0
1648
96
2
20
54
104
4
24
60
112
6
28
66
120
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
0
2
6
12
16
36
60
88
48
102
162
228
96
200
312
422
0
16
48
96
2
36
102
200
6
60
162
312
12
88
228
422
R
S
Transpose
Transpose
Downscaled image after
RowScan
Image after
Integral sum
Matrix Transpose
Implemented in a tiled fashion – coalesced reads & writes to GM
F
urther optimizations continued
….
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Transpose
18
ColumnScan
Breakdown Slide19
Optimizations in Transpose & RS kernels
Matrix TransposeGM coalescing write row wise only 0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Transpose
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Transpose
Naive way
Optimized version
48
96
54
104
Shared memory[BLOCKSIZE] [BLOCKSIZE]
Reading along column
Any
SM bank conflicts
??
32 banks of SM. Threads in a warp access to the same bank
BLOCKSIZE = 16
SM[Ty]Tx
] changed to
SM[Tx][Ty]
(Tx
, Ty)
map
to
(
Tx
* 16 + Ty) % 32 bank
(Tx, Ty) for (0, 0) & (2, 0) map to bank 0 Eliminated by Shared memory [BLOCKSIZE] [BLOCKSIZE + 1]Global memory19Slide20
RowScan – Row wise prefix scanUse extern shared memory (SM) don’t make it a occupancy constraint by hardcoding
Kernel execution configuration Shared memory size req. 2 * (width of image) [One each for integral sum & square Integral sum]At downscale of 256 X 256 image size (from source of 1k X 1k pixels)
1D
256
Threads Per Block (TPB),
256
blocks
6
blocks alive – 100% occupancy (
1536 threads
)
Hardcoding to max case of
1K
TPB (
8kB
SM
) decreases it to 5 blocks (
84% occupancy only
)
Face detection using Cascade Classifier(CC) follows….
20
Optimization in
RowScan
OnlySlide21
OutlineIntroduction
MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier Implementation Evaluation and ResultsConclusion21Slide22
ImplementationScan Window Processing
Image Size: 1024x1024 (can vary)Classifier Size: 25x25 (Fixed)
Some Specs of Classifiers
Each HAAR Classifier has up to 3
rectangles
Each stage consists of up to 211 HAAR
Classifiers
Our algorithm has 25 stages
with 2913 HAAR Classifiers
22Slide23
Image Size: 1024x1024
Classifier Size: 25x25
For each image, we consider 25x25 moving scan
window
Next scan window by pixel
++
Each scan window is processed
independently
For 1024 x 1024
image, there are
1000 x 1000
= 10
6
scan
windows
23
Implementation
Scan Window ProcessingSlide24
Baseline ImplementationScan Window ProcessingEach thread operates on one scan windowEach scan window is processed through all 25 stages to detect faceA Bit Vector keeps track of rejected scan windows
Bit Vector copied back to Host MemoryStage 1Stage 2
Stage 25
Scan Win (20,30)
BV[20][30] = True
BV[20][30] = False
BV[20][30] = True
24Slide25
Optimizations (1)Scan Window ProcessingUse Shared MemoryInformation of classifiers common for all threads
Indices of Haar classifiersWeights of each rectangle in Haar classifierThreshold for each Haar classifierThreshold for each stageBring these data to shared memoryShare data across Thread BlockDue to shared memory limitation, entire Scan Window Processing split into 12 Kernels [Total 2913 Classifiers into 12 Kernels]Each kernel uses approximately 19 kB of shared memory25Slide26
Optimizations (2)Scan Window ProcessingUse Pinned Host MemoryReplace
malloc with cudaMallocHostUse Fast MathVariance = sqrtf(square sum) – Special FU in GPU is used Do not use maxrregcountOur kernel needs around 26 registers If we restrict to 20, spilling occursOccupancy not always a measure of performance
26Slide27
Optimizations (3)
Scan
Window Processing
Use block divergence
If a scan window fails at the end of any kernel, reject
Results in thread divergence
But more importantly, it leads to
block divergence
Thread
blocks with all threads rejected won’t be launched at
all
T
his block divergence is the common case in most of the image windows
Hence according to
Amdahl’s law
, this optimization gave huge performance benefit
27
Kernel
1
Kernel 2
Kernel
1
Kernel 2Slide28
OutlineIntroduction
MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion28Slide29
EvaluationUsed GTX 480 GPU for evaluation15 SMs, 1.5GB Global MemoryShared memory Usage
8.2kB NN and II19kB HAAR KernelsOccupancy 100% for NN and II66.67% for HAAR Kernels1024 x 1024 image size for evaluation 21 downsampling iterationsCompared with CPU single threaded performance
29Slide30
Performance of NN and II Kernels (1)30Slide31
Performance of NN and II Kernels (2)31Slide32
Performance of NN and II Kernels (3)32Slide33
Performance of NN and II Kernels (4)33Slide34
Overall NN and II
SpeedUp = 1.46x34NN + II Overall PerformanceSlide35
Performance of HAAR Kernels
35Slide36
29.7x
82.7x 128.8x 155.9x
161.1x 135.1x 137.8x 139.2x 144x
221.3x
212.7x
Performance of HAAR Kernels (2)
36Slide37
37
Speed Up
Over IterationsSlide38
38
Scanning Window Speed
Up Comparison
Overall
Scanning Window
SpeedUp
=
5
.47xSlide39
39
Overall Face Detection Speed Up
Overall
Scanning Window
SpeedUp
=
5
.35xSlide40
40
GPU Speed Up Over Varying
Image SizesSlide41
Faces
1
2
4
8
16
32
Average Detection Rate %
Detection
Rate
%
100
100
100
87.5
100
93.75
96.875
41
GPU Face Detection AccuracySlide42
42
Lessons Learned & Future Work
GPU provides performance and energy benefits over CPU for parallelizable workloads
But this comes at a
cost
Need to understand the bottlenecks
Can reap
benefits
with finer level optimizations
Future work
Compare GPU performance with equivalent
OpenMP
, MPI code
OpenCV
library provides
CUDA
APIs for Object Detection
Compare performance of our implementation with this
Detection
accuracy can be improved with
more
robust versionSlide43
43
Conclusion
Face Detection is a good candidate for parallelization
Optimizations help in increasing GPU Performance
Up to
5.3x
performance improvement on GPU
Further improvements possible with careful analysis and hardcoding