/
Implementation   of  a  Wide Implementation   of  a  Wide

Implementation of a Wide - PowerPoint Presentation

aaron
aaron . @aaron
Follow
356 views
Uploaded On 2018-02-04

Implementation of a Wide - PPT Presentation

Angle Lens Distortion Correction Algorithm on the Cell Broadband Engine Konstantis Daloukas Christos D Antonopoulos Nikolaos Bellas Department of Computer and Communications Engineering ID: 627865

2009ics fps angle june fps 2009ics june angle distortion wide lenses time correction cell level loop broadband engine lens

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Implementation of a Wide" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Implementation of a Wide-Angle Lens Distortion Correction Algorithmon the Cell Broadband Engine

Konstantis Daloukas

Christos D. Antonopoulos

Nikolaos Bellas

Department of Computer and Communications Engineering

University of Thessaly

Volos, GreeceSlide2

June 9, 2009ICS 20092Introduction

Conventional

rectilinear lens

Full-frame fisheye lens 98 degrees horizontal by 147 degrees vertical

Wide-angle lenses (a.k.a. fisheye lenses) are traditionally used to enlarge the field of view in photography

Full circular fisheye lens 180 degrees horizontal

and verticalSlide3

June 9, 2009ICS 20093IntroductionMain ApplicationsMeteorologyAstronomyRobot NavigationVideo Surveillance

Video ConferencingDigital CamerasThe incoming rays are mapped onto a spherical surface

Such mapping introduces barrel distortionSlide4

June 9, 2009ICS 20094MotivationDistortion must be corrected in real-time25-30 fps in VGA resolution for our applicationReal-time distortion correctionNot feasible with contemporary general purpose processorsCore 2 Quad: 15.82 fps with SSE and 4 threadsUse a high-performance, non-conventional processor such as the CBESlide5

June 9, 2009ICS 20095OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide6

June 9, 2009ICS 20096Cell BE Architecture From: J. A. Kahle et al. Introduction to the Cell multiprocessor IBM Journal of Research and Development, 49(4/5):589-604, July/September 2005.Slide7

June 9, 2009ICS 20097Cell BE Key PerformanceCharacteristicsPeak Performance: 256 Gflops for single-precision FP arithmeticOffers a rich repertoire for exploiting the various levels of parallelism8 SPEs – Thread-Level Parallelism

SIMD Computational Engines – Data-Level ParallelismDual-Issue Pipeline – Instruction-Level ParallelismSlide8

June 9, 2009ICS 20098OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide9

June 9, 2009ICS 20099

Wide-angle Lenses

Distortion Correction

Transformation of the distorted wide-angle images back to the central perspective space. Slide10

June 9, 2009ICS 200910Projection Model of Wide-angle Lenses

Wide-angle Projection

Central Perspective

ProjectionSlide11

June 9, 2009ICS 200911Algorithmic Flow (A)Inverse Mapping: Maps each image point (i, j) to the corresponding point (x, y) in the wide-angle spaceSlide12

June 9, 2009ICS 200912Algorithmic Flow (A)Need to approximate the value of fractional positions in the fisheye spaceComplex, irregular memory access patternSlide13

June 9, 2009ICS 200913Algorithmic Flow (B)Bicubic Interpolation: uses a 4x4 window of pixels to approximate intermediate pointsSlide14

June 9, 2009ICS 200914Algorithmic Flow (B)Bicubic interpolation is broken into horizontal and vertical 1D interpolationCi are the pixel values

s

tSlide15

June 9, 2009ICS 200915Complete AlgorithmFor each pixel (i, j) in the central perspective space { Apply inverse mapping to find fractional

coordinates (x, y) in the wide-angle space Use bicubic interpolation to approximate the pixel

value at (x,y)} Apply a 2D low pass filter and downscale output image to VGA resolution (640x480)Slide16

June 9, 2009ICS 200916OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide17

June 9, 2009ICS 200917Block TilingPartition the output image in blocks and correct a block of pixels at a time Slide18

June 9, 2009ICS 200918Advantages of block tiling technique:Maximize data reuseFacilitates the exploitation of the thread-level parallelism of the algorithmDrawback:

Block TilingSlide19

June 9, 2009ICS 200919Performance afterBlock Tiling

Both processors are not capable for

real-time execution

0.55 fps

2.20 fps

65 %

48.9 %

27 %

12.7 %

7 %

38 %Slide20

June 9, 2009ICS 200920Tile SizeTile size and shape: Very important parameters in explicitly blocked codesTile size must be large enough in order to:Maximize data reuse and increase the working set

Minimize communication overheadTile size must be small due to:The limited capacity of the LS

The curvature of input tilesCell BE imposes strict alignment requirements on DMA transfersAdditional limitations on the size and shapeSlide21

June 9, 2009ICS 200921Tile SizeSlide22

June 9, 2009ICS 200922Thread Level ParallelismExploit thread-level parallelismTiles of the output images are independentOffload the most time-consuming kernels to the SPEsSlide23

June 9, 2009ICS 200923Function Offloading

0.55 fps

15.82 fps

1.19 fps

1.19 fpsSlide24

June 9, 2009ICS 200924VectorizationUtilize the SIMD computation capabilities of the SPEsAccelerate computations by:Clustering four FP operands in a vector 4x implicit loop unrollingAs an additional positive effect the branch misprediction penalty is reducedBackward branches in loops are predicted as not taken

20 cycles misprediction penalty per branch eliminatedSlide25

June 9, 2009ICS 200925Vectorization

0.55 fps

15.82 fps

1.19 fps

1.19 fps

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fpsSlide26

June 9, 2009ICS 200926Color Loop UnrollingThe frames are in the (R, G, B) color spaceEach doubly-nested loop contains an additional loop for the color components

Explicit 3x unrolling: Furthers the positive effects of branch eliminationIncreases the potential for efficient schedulingSlide27

June 9, 2009ICS 200927

0.55 fps

1.19 fps

1.19 fps

10.75 fps

15.82 fps

Color Loop Unrolling

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

14.28 fps

10.75 fpsSlide28

June 9, 2009ICS 200928Unaligned LoadsUnaligned memory accesses due to the formation of the 4x4 window in bicubic interpolationPipeline stalls due to vector loads

r

1

1

2

3

4

r

2

r

3

r

4

5

6

r

1

r

2

r

3

r

4

1

2

3

4

7

8

9

10

11

12

13

14

15

16

5

9

13

6

10

14

7

11

15

8

12

16Slide29

June 9, 2009ICS 200929Unaligned Loads

0.55 fps

1.19 fps

15.82 fps

1.19 fps

14.28 fps

10.75 fps

10.75 fps

14.28 fps

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

15.38 fps

10.75 fps

14.28 fps

15.38 fpsSlide30

June 9, 2009ICS 200930Manual InstructionSchedulingThe compiler proved too conservative rescheduling independent instructionsManually interleaved instructions of vector loads with computational operationsReduced the remaining pipeline stalls

Manual scheduling is facilitated by the loop unrollingThe usage of the dual-issue pipeline increased from 22.6 % to 34.6 %Slide31

June 9, 2009ICS 200931Manual InstructionScheduling

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

15.38 fps

10.75 fps

14.28 fps

15.38 fps

1.19 fps

0.55 fps

15.82 fps

10.75 fps

14.28 fps

15.38 fps

1.19 fps

20 fps

10.75 fps

14.28 fps

15.38 fps

20 fpsSlide32

June 9, 2009ICS 200932Inverse MappingAmortizationThe inverse mapping kernel has to be executed only when the Field-of-View or Region-Of-Interest changesThese parameters change infrequently in a typical usage scenario

We evaluated the option of executing this kernel on the PPE (using the VMX/AltiVec extensions)The coordinates are stored at the main memory

Each SPE fetches the appropriate coordinatesThe execution time of the algorithm decreased to 0.045 sec./frame when 6 SPEs are used0.033 sec./frame when 8 SPEs are usedSlide33

June 9, 2009ICS 200933OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionsSlide34

June 9, 2009ICS 200934ConclusionsOutlined and evaluated the various optimizations needed to achieve real-time wide-angle lens distortion correction on the Cell BEMost optimizations applicable to many stencil computation algorithmsCounter-intuitive optimizations highly unlike to be made automatically

Commercially available general purpose multi-cores not capable of handling real-time distortion correctionMore mature compiler technology needed