Implementation of a Wide

Implementation   of  a  Wide Implementation   of  a  Wide - Start

2018-02-04 14K 14 0 0

Implementation of a Wide - Description

-. Angle. . Lens. . Distortion. Correction Algorithm. on. . the. . Cell. . Broadband. . Engine. Konstantis Daloukas. Christos D. Antonopoulos. Nikolaos . Bellas. Department of Computer and Communications Engineering. ID: 627865 Download Presentation

Download Presentation

Implementation of a Wide




Download Presentation - The PPT/PDF document "Implementation of a Wide" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Implementation of a Wide

Slide1

Implementation of a Wide-Angle Lens Distortion Correction Algorithmon the Cell Broadband Engine

Konstantis Daloukas

Christos D. Antonopoulos

Nikolaos Bellas

Department of Computer and Communications Engineering

University of Thessaly

Volos, Greece

Slide2

June 9, 2009ICS 20092Introduction

Conventional

rectilinear lens

Full-frame fisheye lens 98 degrees horizontal by 147 degrees vertical

Wide-angle lenses (a.k.a. fisheye lenses) are traditionally used to enlarge the field of view in photography

Full circular fisheye lens 180 degrees horizontal

and vertical

Slide3

June 9, 2009ICS 20093IntroductionMain ApplicationsMeteorologyAstronomyRobot Navigation

Video SurveillanceVideo ConferencingDigital CamerasThe incoming rays are mapped onto a spherical surface

Such mapping introduces barrel distortion

Slide4

June 9, 2009ICS 20094MotivationDistortion must be corrected in real-time25-30 fps in VGA resolution for our applicationReal-time distortion correctionNot feasible with contemporary general purpose processorsCore 2 Quad: 15.82 fps with SSE and 4 threadsUse a high-performance, non-conventional processor such as the CBE

Slide5

June 9, 2009ICS 20095OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusion

Slide6

June 9, 2009ICS 20096Cell BE Architecture From: J. A. Kahle et al. Introduction to the Cell multiprocessor IBM Journal of Research and Development, 49(4/5):589-604, July/September 2005.

Slide7

June 9, 2009ICS 20097Cell BE Key PerformanceCharacteristicsPeak Performance: 256 Gflops for single-precision FP arithmeticOffers a rich repertoire for exploiting the various levels of parallelism8 SPEs – Thread-Level Parallelism

SIMD Computational Engines – Data-Level ParallelismDual-Issue Pipeline – Instruction-Level Parallelism

Slide8

June 9, 2009ICS 20098OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusion

Slide9

June 9, 2009ICS 20099

Wide-angle Lenses

Distortion Correction

Transformation of the distorted wide-angle images back to the central perspective space.

Slide10

June 9, 2009ICS 200910Projection Model of Wide-angle Lenses

Wide-angle Projection

Central Perspective

Projection

Slide11

June 9, 2009ICS 200911Algorithmic Flow (A)Inverse Mapping: Maps each image point (i, j) to the corresponding point (x, y) in the wide-angle space

Slide12

June 9, 2009ICS 200912Algorithmic Flow (A)Need to approximate the value of fractional positions in the fisheye spaceComplex, irregular memory access pattern

Slide13

June 9, 2009ICS 200913Algorithmic Flow (B)Bicubic Interpolation: uses a 4x4 window of pixels to approximate intermediate points

Slide14

June 9, 2009ICS 200914Algorithmic Flow (B)Bicubic interpolation is broken into horizontal and vertical 1D interpolationCi are the pixel values

s

t

Slide15

June 9, 2009ICS 200915Complete AlgorithmFor each pixel (i, j) in the central perspective space { Apply inverse mapping to find fractional

coordinates (x, y) in the wide-angle space Use bicubic interpolation to approximate the pixel

value at (x,y)} Apply a 2D low pass filter and downscale output image to VGA resolution (640x480)

Slide16

June 9, 2009ICS 200916OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusion

Slide17

June 9, 2009ICS 200917Block TilingPartition the output image in blocks and correct a block of pixels at a time

Slide18

June 9, 2009ICS 200918Advantages of block tiling technique:Maximize data reuseFacilitates the exploitation of the thread-level parallelism of the algorithmDrawback:

Block Tiling

Slide19

June 9, 2009ICS 200919Performance afterBlock Tiling

Both processors are not capable for

real-time execution

0.55 fps

2.20 fps

65 %

48.9 %

27 %

12.7 %

7 %

38 %

Slide20

June 9, 2009ICS 200920Tile SizeTile size and shape: Very important parameters in explicitly blocked codesTile size must be large enough in order to:Maximize data reuse and increase the working set

Minimize communication overheadTile size must be small due to:The limited capacity of the LS

The curvature of input tilesCell BE imposes strict alignment requirements on DMA transfersAdditional limitations on the size and shape

Slide21

June 9, 2009ICS 200921Tile Size

Slide22

June 9, 2009ICS 200922Thread Level ParallelismExploit thread-level parallelismTiles of the output images are independentOffload the most time-consuming kernels to the SPEs

Slide23

June 9, 2009ICS 200923Function Offloading

0.55 fps

15.82 fps

1.19 fps

1.19 fps

Slide24

June 9, 2009ICS 200924VectorizationUtilize the SIMD computation capabilities of the SPEsAccelerate computations by:Clustering four FP operands in a vector 4x implicit loop unrollingAs an additional positive effect the branch misprediction penalty is reducedBackward branches in loops are predicted as not taken

20 cycles misprediction penalty per branch eliminated

Slide25

June 9, 2009ICS 200925Vectorization

0.55 fps

15.82 fps

1.19 fps

1.19 fps

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

Slide26

June 9, 2009ICS 200926Color Loop UnrollingThe frames are in the (R, G, B) color spaceEach doubly-nested loop contains an additional loop for the color components

Explicit 3x unrolling: Furthers the positive effects of branch eliminationIncreases the potential for efficient scheduling

Slide27

June 9, 2009ICS 200927

0.55 fps

1.19 fps

1.19 fps

10.75 fps

15.82 fps

Color Loop Unrolling

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

14.28 fps

10.75 fps

Slide28

June 9, 2009ICS 200928Unaligned LoadsUnaligned memory accesses due to the formation of the 4x4 window in bicubic interpolationPipeline stalls due to vector loads

r

1

1

2

3

4

r

2

r

3

r

4

5

6

r

1

r

2

r

3

r

4

1

2

3

4

7

8

9

10

11

12

13

14

15

16

5

9

13

6

10

14

7

11

15

8

12

16

Slide29

June 9, 2009ICS 200929Unaligned Loads

0.55 fps

1.19 fps

15.82 fps

1.19 fps

14.28 fps

10.75 fps

10.75 fps

14.28 fps

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

15.38 fps

10.75 fps

14.28 fps

15.38 fps

Slide30

June 9, 2009ICS 200930Manual InstructionSchedulingThe compiler proved too conservative rescheduling independent instructionsManually interleaved instructions of vector loads with computational operationsReduced the remaining pipeline stalls

Manual scheduling is facilitated by the loop unrollingThe usage of the dual-issue pipeline increased from 22.6 % to 34.6 %

Slide31

June 9, 2009ICS 200931Manual InstructionScheduling

0.55 fps

1.19 fps

15.82 fps

1.19 fps

10.75 fps

14.28 fps

15.38 fps

10.75 fps

14.28 fps

15.38 fps

1.19 fps

0.55 fps

15.82 fps

10.75 fps

14.28 fps

15.38 fps

1.19 fps

20 fps

10.75 fps

14.28 fps

15.38 fps

20 fps

Slide32

June 9, 2009ICS 200932Inverse MappingAmortizationThe inverse mapping kernel has to be executed only when the Field-of-View or Region-Of-Interest changesThese parameters change infrequently in a typical usage scenario

We evaluated the option of executing this kernel on the PPE (using the VMX/AltiVec extensions)The coordinates are stored at the main memory

Each SPE fetches the appropriate coordinatesThe execution time of the algorithm decreased to 0.045 sec./frame when 6 SPEs are used0.033 sec./frame when 8 SPEs are used

Slide33

June 9, 2009ICS 200933OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusions

Slide34

June 9, 2009ICS 200934ConclusionsOutlined and evaluated the various optimizations needed to achieve real-time wide-angle lens distortion correction on the Cell BEMost optimizations applicable to many stencil computation algorithmsCounter-intuitive optimizations highly unlike to be made automatically

Commercially available general purpose multi-cores not capable of handling real-time distortion correctionMore mature compiler technology needed


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.