/
An Efficient GPGPU Implementation of An Efficient GPGPU Implementation of

An Efficient GPGPU Implementation of - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
379 views
Uploaded On 2017-07-03

An Efficient GPGPU Implementation of - PPT Presentation

ViolaJones Classifier based Face Detection Algorithm Sharmila Shridhar Vinay Gangadhar Ram Sai Manoj ECE 759 Project Presentation Fall 2015 University of Wisconsin Madison 1 Executive Summary ID: 566210

window image scan haar image window haar scan detection sum amp integral classifier performance memory face kernel kernels thread

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Efficient GPGPU Implementation of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An Efficient GPGPU Implementation of Viola-Jones Classifier based Face Detection Algorithm

Sharmila Shridhar, Vinay Gangadhar, Ram Sai ManojECE 759 Project PresentationFall 2015University of Wisconsin - Madison

1Slide2

Executive SummaryViola Jones Classifier based Face Detection Implemented on GPU

Detection done in 2 phases:Nearest Neighbor and Integral ImageScanning Window and HAAR Feature DetectionOptimizations applied to both phasesShared Memory, No bank conflicts, TB divergence etc.Upto 5.3x SpeedUp compared to single threaded CPU performanceGPU performs better for larger images2Slide3

IntroductionFace detection a hot algorithm Auto tagging pictures in FacebookEasy search in Google Photos

Biometric based security accessThreat activity detectionFace mapping Human Faces  Similar properties (HAAR)Different type of algortihmsMotion basedColor basedViola Jones Classifier based algorithm3Slide4

MotivationFace Detection algorithms have large amount of Data Level Parallelism (DLP)

Involve processing each window separately to detect faceGPU resources can be utilized efficientlyPerformance and Energy efficient Face detection implementationThis application can be used toshowcase the benefits of GPGPU implementations

howcase incremental benefits with fine tuning of the code (optimizations)

4Slide5

OutlineIntroduction

MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion5Slide6

Background- Viola Jones Algorithm (1)Haar Feature Selection

Each feature consists of white and black rectanglesSubtract white region’s pixel sum from black region’s pixel sumIf (sum > threshold) Region has Haar FeatureWe use pre-selected Haar Features as classifiers for Face Detection

6Slide7

Background- Viola Jones Algorithm (2)Integral Image Calculation

Sum calculation for each Haar rectangle is very importantSped up by using Integral ImageIntegral Sum at any pixel 

,

where

is the value of the pixel

 

7Slide8

Background- Viola Jones Algorithm (3)Cascade ClassifierAdaboost AlgorithmOutput of several weak classifiers (Haar Features) combined into weighted sum to form Strong Classifier

Further concatenates strong classifiers to a Cascade Classifier

Cascade Classifier

+

8Slide9

Background - Image PyramidWe use 25x25 window size for each featureBut face in the image can be of any sizeUse Image Pyramid to make face detection scale invariant

9Slide10

OutlineIntroduction

MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion10Slide11

Group rectangles

Draw rectangles around the faces

Shift the detection window

Compute the image co-ordinates for each

HAAR

feature

Compute sum of pixels

from [0,0]

[x,y]

Compute sum of squares of pixels from [0,0]

[x,y]

Image downscaling

factor

1.2

Detection window

25 X 25

Image / Scaling Factor >= 25

Read the source Image and Cascade classifier parameters

Implementation Flow

Nearest Neighbor

Integral Image

Set Image for

HAAR Detection

Run

Cascade Classifier

Image with detected Faces

Integral sum > threshold all stages

Face detected, store the co-ordinates

Integral sum < threshold for a stage

Skip this window for further stages Slide12

Nearest Neighbor (NN)

016

32

48

2

18

34

50

4

20

36

52

6

22

38

54

Computes the image pixels for the downscaled (DS) image

Used scaling factor of 1.2 in implementation

Detection window of 25 X 25 pixels

Image downscaled until it’s equal to detection window

Why downscale ??

Example:

downscaled image with scaling factor of 2

Parallelization scope

DS

image pixels

calculated

by

scale factor offset, width,

height of source

image

Map

each (or more) pixel position to a single

thread

0

8

16

24

32

40

48

56

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

233139475563

NearestNeighbor

12Slide13

Integral Image (II)

016

32

48

2

18

34

50

4

20

36

52

6

22

38

54

Sum of all the pixels above & to the left of (X, Y) for each X & Y in the image

structure

Integral Image

Prefix scan along

row

for all rows and then along column

Parallelization scope

Prefix scan of each row independent of other rows

RowScan

(RS)

Prefix scan along columns

ColumnScan

(CS)

Similarly compute square integral sum (sum of squares

)

Why ? Need the variance of the pixels for the haar rectangle co-ordinates

Var

(X

) = E(X

2

) - (E(X))

2

0

16

48

96

2

36

102

200

6

60

162

312

12

88

228

422

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

RS

C

S

13Slide14

NN + II ImplementationImplementing the

NN and II Sum in 4 separate kernelsKernel 1  Nearest Neighbor(NN) & RowScan (RS) – downscaled image readyKernel 2  Matrix TransposeKernel

3

RowScan

Kernel

4

Matrix Transpose

Integral

sum

&

square integral sum

ready at the end of kernel 4

ColumnScan

14Slide15

Kernel 1  Nearest Neighbor (NN)

& RowScan (RS)0

16

32

48

2

18

34

50

4

20

36

52

6

22

38

54

NNNN

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

NN

RS

Combine NN & RS

 eliminates 1 global memory access (storing in between kernels)

K

ernel configuration:

(

w

, h – width & height of downscaled

Image

)

Threads per block = smallestpower2(w) – Constraint from RowScan algorithm

Blocks

= h

RowScan

– Inclusive prefix scan of each row in image

Harris-

Sengupta

-Owen algorithm (Upsweep & downsweep)

0

8

16

24

32

40

48

56

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5132129374553

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

0

16

32

48

Shared memory [2 * BLOCKSIZE + 1]

15Slide16

Columnscan replaced with kernel 2, 3 & 4Why ? Could have been done as Rowscan & columnscan

0

16

48

96

2

36

102

200

6

60

162

312

12

88

228

422

RS

C

S

Straightforward implementation

Kernel 2, 3 & 4

ColumnScan(CS)

0

16

32

48

2

18

34

50

4

20

36

52

6

22

38

54

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

16Slide17

Motivation for Transpose

016

48

96

2

20

54

104

4

24

60

112

6

28

66

120

Global memory reads aren’t coalesced

Time consuming (400 – 500 cycles each access !!) & each thread needs separate access

ColumnScan

– perform prefix scan along the column

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

T0

T1

T2

T3

Global memory (GM) layout of above 2D matrix

Thread 0

Thread 1

Thread 2

Thread 3

17

Alternate Method for

ColumnScan

Row 0

Row 1

Row 2

Row 3Slide18

0

1648

96

2

20

54

104

4

24

60

112

6

28

66

120

0

2

4

6

16

20

24

28

48

54

60

66

96

104

112

120

0

2

6

12

16

36

60

88

48

102

162

228

96

200

312

422

0

16

48

96

2

36

102

200

6

60

162

312

12

88

228

422

R

S

Transpose

Transpose

Downscaled image after

RowScan

Image after

Integral sum

Matrix Transpose

Implemented in a tiled fashion – coalesced reads & writes to GM

F

urther optimizations continued

….

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

0

2

4

6

16

20

24

28

48

54

60

66

96

104

112

120

Transpose

18

ColumnScan

Breakdown Slide19

Optimizations in Transpose & RS kernels

Matrix TransposeGM coalescing  write row wise only 0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

0

2

4

6

16

20

24

28

48

54

60

66

96

104

112

120

Transpose

0

16

48

96

2

20

54

104

4

24

60

112

6

28

66

120

0

2

4

6

16

20

24

28

48

54

60

66

96

104

112

120

Transpose

Naive way

Optimized version

48

96

54

104

Shared memory[BLOCKSIZE] [BLOCKSIZE]

Reading along column

 Any

SM bank conflicts

??

32 banks of SM. Threads in a warp access to the same bank

BLOCKSIZE = 16

SM[Ty]Tx

] changed to

SM[Tx][Ty]

(Tx

, Ty)

map

to

(

Tx

* 16 + Ty) % 32 bank

(Tx, Ty) for (0, 0) & (2, 0) map to bank 0 Eliminated by Shared memory [BLOCKSIZE] [BLOCKSIZE + 1]Global memory19Slide20

RowScan – Row wise prefix scanUse extern shared memory (SM)  don’t make it a occupancy constraint by hardcoding

Kernel execution configuration Shared memory size req.  2 * (width of image) [One each for integral sum & square Integral sum]At downscale of 256 X 256 image size (from source of 1k X 1k pixels)

1D

256

Threads Per Block (TPB),

256

blocks

6

blocks alive – 100% occupancy (

1536 threads

)

Hardcoding to max case of

1K

TPB (

8kB

SM

) decreases it to 5 blocks (

84% occupancy only

)

Face detection using Cascade Classifier(CC) follows….

20

Optimization in

RowScan

OnlySlide21

OutlineIntroduction

MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier Implementation Evaluation and ResultsConclusion21Slide22

ImplementationScan Window Processing

Image Size: 1024x1024 (can vary)Classifier Size: 25x25 (Fixed)

Some Specs of Classifiers

Each HAAR Classifier has up to 3

rectangles

Each stage consists of up to 211 HAAR

Classifiers

Our algorithm has 25 stages

with 2913 HAAR Classifiers

22Slide23

Image Size: 1024x1024

Classifier Size: 25x25

For each image, we consider 25x25 moving scan

window

Next scan window by pixel

++

Each scan window is processed

independently

For 1024 x 1024

image, there are

1000 x 1000

= 10

6

scan

windows

23

Implementation

Scan Window ProcessingSlide24

Baseline ImplementationScan Window ProcessingEach thread operates on one scan windowEach scan window is processed through all 25 stages to detect faceA Bit Vector keeps track of rejected scan windows

Bit Vector copied back to Host MemoryStage 1Stage 2

Stage 25

Scan Win (20,30)

BV[20][30] = True

BV[20][30] = False

BV[20][30] = True

24Slide25

Optimizations (1)Scan Window ProcessingUse Shared MemoryInformation of classifiers common for all threads

Indices of Haar classifiersWeights of each rectangle in Haar classifierThreshold for each Haar classifierThreshold for each stageBring these data to shared memoryShare data across Thread BlockDue to shared memory limitation, entire Scan Window Processing split into 12 Kernels [Total 2913 Classifiers into 12 Kernels]Each kernel uses approximately 19 kB of shared memory25Slide26

Optimizations (2)Scan Window ProcessingUse Pinned Host MemoryReplace

malloc with cudaMallocHostUse Fast MathVariance = sqrtf(square sum) – Special FU in GPU is used Do not use maxrregcountOur kernel needs around 26 registers If we restrict to 20, spilling occursOccupancy not always a measure of performance

26Slide27

Optimizations (3)

Scan

Window Processing

Use block divergence

If a scan window fails at the end of any kernel, reject

Results in thread divergence

But more importantly, it leads to

block divergence

Thread

blocks with all threads rejected won’t be launched at

all

T

his block divergence is the common case in most of the image windows

Hence according to

Amdahl’s law

, this optimization gave huge performance benefit

27

Kernel

1

Kernel 2

Kernel

1

Kernel 2Slide28

OutlineIntroduction

MotivationViola Jones BackgroundNearest Neighbor and Integral ImageHAAR based Cascade Classifier ImplementationEvaluation and ResultsConclusion28Slide29

EvaluationUsed GTX 480 GPU for evaluation15 SMs, 1.5GB Global MemoryShared memory Usage

8.2kB  NN and II19kB  HAAR KernelsOccupancy  100% for NN and II66.67% for HAAR Kernels1024 x 1024 image size for evaluation  21 downsampling iterationsCompared with CPU single threaded performance

29Slide30

Performance of NN and II Kernels (1)30Slide31

Performance of NN and II Kernels (2)31Slide32

Performance of NN and II Kernels (3)32Slide33

Performance of NN and II Kernels (4)33Slide34

Overall NN and II

SpeedUp = 1.46x34NN + II Overall PerformanceSlide35

Performance of HAAR Kernels

35Slide36

29.7x

82.7x 128.8x 155.9x

161.1x 135.1x 137.8x 139.2x 144x

221.3x

212.7x

Performance of HAAR Kernels (2)

36Slide37

37

Speed Up

Over IterationsSlide38

38

Scanning Window Speed

Up Comparison

Overall

Scanning Window

SpeedUp

=

5

.47xSlide39

39

Overall Face Detection Speed Up

Overall

Scanning Window

SpeedUp

=

5

.35xSlide40

40

GPU Speed Up Over Varying

Image SizesSlide41

Faces

1

2

4

8

16

32

Average Detection Rate %

Detection

Rate

%

100

100

100

87.5

100

93.75

96.875

41

GPU Face Detection AccuracySlide42

42

Lessons Learned & Future Work

GPU provides performance and energy benefits over CPU for parallelizable workloads

But this comes at a

cost

 Need to understand the bottlenecks

Can reap

benefits

with finer level optimizations

Future work

Compare GPU performance with equivalent

OpenMP

, MPI code

OpenCV

library provides

CUDA

APIs for Object Detection

Compare performance of our implementation with this

Detection

accuracy can be improved with

more

robust versionSlide43

43

Conclusion

Face Detection is a good candidate for parallelization

Optimizations help in increasing GPU Performance

Up to

5.3x

performance improvement on GPU

Further improvements possible with careful analysis and hardcoding